Mo’ Nagios Projects

This last week, I taught a Nagios class and ironed out a few more integration pieces that I will probably incorporate into my next class. I’ve been getting a lot of emails and interest in a “boot camp” sort of thing (similar to the CCIE bootcamp sort of thing), and if I can get the corporate sponsorship and community interest, I will definitely think about helping put one together. (Any ideas on what that might look like would be greatly appreciated!)

The Nagios class turned out to be quite a cool experience. I had about seven students who came from SAIC (1), Horizon Technology, (1), D-Link (4), and RK (1). LAMP skills between students varied widely, and I’m hoping that everyone left the class with something significant. Some of them have joined our LinkedIn group and have been networking with others in our group.

For what it’s worth, here are four more integration projects that I started this week. I will iron out HOWTOs on NagiosWiki once I get some time. (Those wanting specifics beyond what I have listed are welcome to email me, but please realize that I may not have all the kinks worked out yet)

1. ticketing and Nagios: once Nagios detects a down host or service, email OSTR (open source ticket request system) and autogenerate a troubleticket. When that host or service comes back online, the email on status change closes out the ticket. Once I iron out a few wrinkles, I’ll integrate this into RT (which I think is “better” in some ways).

2. Monitoring for ssh key corruption: When ssh keys get corrupted and need to be regenerated, check_ssh will not detect the login error (to my knowledge, at least). I’m not a Perl programmer, but I’m hoping that something in Net::Telnet (which can be told to use SSH for the underlying transport) or Net::SSH can help me prove that a login is failing on a few thousand routers. (Still googling for what others have done in this department. Any ideas here would be greatly appreciated!)

3. service reliability checks using NagiosPluginsNT: If you’d like to run a check *from* some weird nook and cranny in your network and do not want to deploy a Linux box with NSCA so you can relay passive checks, consider doing the following:

a. installing NSclient++

b. dropping the NagiosPluginsNT plugins in your c:\path\to\nsclient++\scripts directory

c. modding your c:\path\to\nsclient++\nsc.ini file to include

check_http_google=C:\Program Files\nsclient++\scripts\check_http.exe -H www.google.com

(Of course, test this from your Nagios server - “check_nrpe -H windows-box -c check_http_google“)

4. check_disk on /proc/mounts: started adding the following NRPE handler in the nrpe.cfg of various Linux servers with weird disk partitioning.

check command[check_disks_proc_mounts]=/usr/lib/nagios/plugins/check_disk -w 15% -c 10% $(for x in $(cat /proc/mounts |awk ‘{print $2}’)\; do echo -n ” -p $x “\; done)

(I had horrible problems with this command yesterday, as vim commented out certain sections, I wasted time trying to escape those characters. Muchos grassyass to my buddy Ed for helping me debug this one!)

Traditionally, one would just run “fdisk -l” or “df -h” and then write a separate NRPE handler for each command. In environments with crazy partitioning (or, better yet, NO partitioning!) or crazyass LUNs volumes, you gotta just send one command that checks the collective health of everything and then reports back if one of those volumes has exceeded its critical or warning level. If that server is important enough to check a particular volume or media for a specific parameter, then consider hard coding a specific NRPE handler for that server.

5. DNX + Nagios: This project offloads active checks to worker boxes, saving you (theoretically) lots and lots of time changing active checks to passive ones via NRPE and NSCA. I just untar’d the project and have been reading over the documentation. It looks easy enough to integrate, but I’ll know more when I put the rubber to the road.

The bottom line to these projects: automate, automate, automate! If you have to do it once, then do it manually. If you have to do it twice, do it manually *and* document. If you have to do it manually a hundred times? Nigga please….automate, yo!


About this entry