Monday 19 September 2011

Nagios Plugin 'check_procs' incorrectly finds 0 processes

When checking for running processes on remote Linux systems via NRPE, the Nagios plugin check_procs –C <process commandname> occasionally responds with unexpected results.

Example on a Zenworks 7 server

If we look for the tftpd daemon using ps:
ZEN03:/usr/local/nagios/libexec # ps -ef|grep tftpd
root     4103      1  0 Sep14 ?        00:01:31 /opt/novell/bin/novell-tftpd
root     20047 17950  0 13:06 pts/0    00:00:00 grep tftpd

Then we look for it with check_procs:
ZEN03:/usr/local/nagios/libexec # ./check_procs -C novell-tftpd
PROCS OK: 1 process with command name 'novell-tftpd'

The check_procs plugin correctly reports that one process has been found with this name

However, if we look for the proxy dhcp daemon using ps:
ZEN03:/usr/local/nagios/libexec # ps -ef |grep proxy
root  21171     1  0 Sep18 ?        00:00:00 /opt/novell/bin/novell-proxydhcpd
root  20137 17950  0 13:07 pts/0    00:00:00 grep proxy

And then with check_procs:
ZEN03:/usr/local/nagios/libexec # ./check_procs -C novell-proxydhcpd
PROCS OK: 0 processes with command name 'novell-proxydhcpd'

In this case,  the check_procs plugin has reported that 0 processes have been found, even though we can clearly see that this is not the case.

The trick in these situations is to ask check_procs for more information using the –vv switch:
ZEN03:/usr/local/nagios/libexec # ./check_procs -vv -C novell-proxydhcpd
CMD: /bin/ps axwo 'stat uid pid ppid vsz rss pcpu comm args'
PROCS OK: 0 processes with command name 'novell-proxydhcpd'

Here check_procs has told us what it is passing to ps to find out the information it is reporting back to us.

So let us use that parameter list for our own check:
ZEN03:/usr/local/nagios/libexec # /bin/ps axwo 'stat uid pid ppid vsz rss pcpu comm args'|grep proxy
S       0 21171     1   1412   396  0.0 novell-proxydhc /opt/novell/bin/novell-proxydhcpd
S+      0 20263 17950   1712   672  0.0 grep            grep proxy

And there it is – on this system (SLES9), ps is only reporting back the first 15 characters, so no match is being found.

So here, what we have to do is to ask check_procs to only look for the first 15 characters
ZEN03:/usr/local/nagios/libexec # ./check_procs -C novell-proxydhc
PROCS OK: 1 process with command name 'novell-proxydhc'

We can now correctly check for our proxy dhcp daemon in Nagios.

Monday 12 September 2011

Nagios check_vmfs Quirk

I came across Duncan Epping's check_vmfs bash script the other day when I was looking at what NRPE could do in our ESX 3.5 environment. It's fairly basic and did what it said it could do: report on vmfs space utilisation.

For interest, here is a copy of the script I wanted to use:

#!/bin/bash
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
MYVOL=$1
WARNTHRESH=$2
CRITTHRESH=$3
RET=$?
if [[ $RET -ne 0 ]]
    then
    echo "query problem - No data received from host"
    exit $STATE_UNKNOWN
fi
vdf -h -P | grep -E '^/vmfs/volumes/' | awk '{ print $2 " " $3 " " $4 " " $5 " " $6 }' | while read output ; do
DISKSIZE=$(echo $output | awk '{ print $1 }' )
DISKUSED=$(echo $output | awk '{ print $2 }' )
DISKAVAILABLE=$(echo $output | awk '{ print $3 }' )
PERCENTINUSE=$(echo $output | awk '{ print $4 }' )
VOLNAME=$(echo $output | awk '{ print $5 }' )
CUTPERC=$(echo $PERCENTINUSE | cut -d'%' -f1 )
if [ "/vmfs/volumes/$MYVOL" = $VOLNAME ] ; then
    if [ $CUTPERC -lt $WARNTHRESH ] ; then
        echo "OK - $PERCENTINUSE used | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
        exit $STATE_OK
    fi
    if [ $CUTPERC -ge $CRITTHRESH ] ; then
        echo "CRITICAL - *$PERCENTINUSE used* | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
        exit $STATE_CRITICAL
    fi
    if [ $CUTPERC -ge $WARNTHRESH ] ; then
        echo "WARNING - *$PERCENTINUSE used* | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
        exit $STATE_WARNING
    fi
fi
#echo "No data returned"
#exit $STATE_UNKNOWN

However, in a SAN environment where every host in an ESX environment has the same LUNs attached for HA, having every host report on space utilisation for the same list of LUNs is a bit over the top, and I'm actually more interested in knowing if a host has lost its path(s) to a LUN.

So I thought I would make a small modification to the script so that it would actually complain if the VMFS (LUN) disappeared. In its original form the script would produce no output to cover this eventuality, and I thought I could see why: the two commented lines just before the end of the while loop needed to be uncommented, changed to report a missing vmfs and moved outside the loop to pick up anything that didn't hit any of the exit statements. It should have taken about 30 seconds.

Well I spent about an hour grappling with a very curious symptom: if any of the conditions inside the loop were met, the script would provide the correct response, but then, instead of exiting, would appear to carry on executing any statements after the loop. This meant that the script would report that it had found a vmfs volume AND report that it couldn't find it. I tried setting variables inside the loop to pick up back in the main section, even exporting them to make them exist outside the script; but I seemed to be stuck with some sort of scoping problem.

It was only when I came across an article by Craig Russell on BASH variable scope inside a While loop, that I realised what the problem was: the while loop was sitting on the end of a pipe and in Bash that meant that the while loop was running in a different process. So the exit statements and any variables set inside the loop have no effect in the outer script.

Craig had a reasonable alternative - pipe out to a temporary file and hang the while loop onto the file instead. I was then able to implement my properly performing script:

#!/bin/sh
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3

MYVOL=$1
WARNTHRESH=$2
CRITTHRESH=$3
RET=$?

if [[ $RET -ne 0 ]]
then
  echo "query problem - No data received from host"
  exit $STATE_UNKNOWN
fi

vdf -h -P | grep -E '^/vmfs/volumes/' | awk '{ print $2 " " $3 " " $4 " " $5 " " $6 }' >/tmp/vmfslist.tmp

while read output
  do
  DISKSIZE=$(echo $output | awk '{ print $1 }' )
  DISKUSED=$(echo $output | awk '{ print $2 }' )
  DISKAVAILABLE=$(echo $output | awk '{ print $3 }' )
  PERCENTINUSE=$(echo $output | awk '{ print $4 }' )
  VOLNAME=$(echo $output | awk '{ print $5 }' )
  CUTPERC=$(echo $PERCENTINUSE | cut -d'%' -f1 )
  if [ "/vmfs/volumes/$MYVOL" = $VOLNAME ]
    then
    if [ $CUTPERC -lt $WARNTHRESH ]
      then
      echo "OK - $PERCENTINUSE used | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
      exit $STATE_OK
    elif [ $CUTPERC -ge $CRITTHRESH ]
      then
      echo "CRITICAL - *$PERCENTINUSE used* | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
      exit $STATE_CRITICAL
    elif [ $CUTPERC -ge $WARNTHRESH ]
      then
      echo "WARNING - *$PERCENTINUSE used* | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
      exit $STATE_WARNING
    fi
  fi
done < /tmp/vmfslist.tmp

echo "No data returned. VMFS Unavailable?"
exit $STATE_CRITICAL


The lesson? There's no real substitute for knowing how stuff works... properly.