Monday 12 September 2011

Nagios check_vmfs Quirk

I came across Duncan Epping's check_vmfs bash script the other day when I was looking at what NRPE could do in our ESX 3.5 environment. It's fairly basic and did what it said it could do: report on vmfs space utilisation.

For interest, here is a copy of the script I wanted to use:

#!/bin/bash
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
MYVOL=$1
WARNTHRESH=$2
CRITTHRESH=$3
RET=$?
if [[ $RET -ne 0 ]]
    then
    echo "query problem - No data received from host"
    exit $STATE_UNKNOWN
fi
vdf -h -P | grep -E '^/vmfs/volumes/' | awk '{ print $2 " " $3 " " $4 " " $5 " " $6 }' | while read output ; do
DISKSIZE=$(echo $output | awk '{ print $1 }' )
DISKUSED=$(echo $output | awk '{ print $2 }' )
DISKAVAILABLE=$(echo $output | awk '{ print $3 }' )
PERCENTINUSE=$(echo $output | awk '{ print $4 }' )
VOLNAME=$(echo $output | awk '{ print $5 }' )
CUTPERC=$(echo $PERCENTINUSE | cut -d'%' -f1 )
if [ "/vmfs/volumes/$MYVOL" = $VOLNAME ] ; then
    if [ $CUTPERC -lt $WARNTHRESH ] ; then
        echo "OK - $PERCENTINUSE used | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
        exit $STATE_OK
    fi
    if [ $CUTPERC -ge $CRITTHRESH ] ; then
        echo "CRITICAL - *$PERCENTINUSE used* | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
        exit $STATE_CRITICAL
    fi
    if [ $CUTPERC -ge $WARNTHRESH ] ; then
        echo "WARNING - *$PERCENTINUSE used* | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
        exit $STATE_WARNING
    fi
fi
#echo "No data returned"
#exit $STATE_UNKNOWN

However, in a SAN environment where every host in an ESX environment has the same LUNs attached for HA, having every host report on space utilisation for the same list of LUNs is a bit over the top, and I'm actually more interested in knowing if a host has lost its path(s) to a LUN.

So I thought I would make a small modification to the script so that it would actually complain if the VMFS (LUN) disappeared. In its original form the script would produce no output to cover this eventuality, and I thought I could see why: the two commented lines just before the end of the while loop needed to be uncommented, changed to report a missing vmfs and moved outside the loop to pick up anything that didn't hit any of the exit statements. It should have taken about 30 seconds.

Well I spent about an hour grappling with a very curious symptom: if any of the conditions inside the loop were met, the script would provide the correct response, but then, instead of exiting, would appear to carry on executing any statements after the loop. This meant that the script would report that it had found a vmfs volume AND report that it couldn't find it. I tried setting variables inside the loop to pick up back in the main section, even exporting them to make them exist outside the script; but I seemed to be stuck with some sort of scoping problem.

It was only when I came across an article by Craig Russell on BASH variable scope inside a While loop, that I realised what the problem was: the while loop was sitting on the end of a pipe and in Bash that meant that the while loop was running in a different process. So the exit statements and any variables set inside the loop have no effect in the outer script.

Craig had a reasonable alternative - pipe out to a temporary file and hang the while loop onto the file instead. I was then able to implement my properly performing script:

#!/bin/sh
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3

MYVOL=$1
WARNTHRESH=$2
CRITTHRESH=$3
RET=$?

if [[ $RET -ne 0 ]]
then
  echo "query problem - No data received from host"
  exit $STATE_UNKNOWN
fi

vdf -h -P | grep -E '^/vmfs/volumes/' | awk '{ print $2 " " $3 " " $4 " " $5 " " $6 }' >/tmp/vmfslist.tmp

while read output
  do
  DISKSIZE=$(echo $output | awk '{ print $1 }' )
  DISKUSED=$(echo $output | awk '{ print $2 }' )
  DISKAVAILABLE=$(echo $output | awk '{ print $3 }' )
  PERCENTINUSE=$(echo $output | awk '{ print $4 }' )
  VOLNAME=$(echo $output | awk '{ print $5 }' )
  CUTPERC=$(echo $PERCENTINUSE | cut -d'%' -f1 )
  if [ "/vmfs/volumes/$MYVOL" = $VOLNAME ]
    then
    if [ $CUTPERC -lt $WARNTHRESH ]
      then
      echo "OK - $PERCENTINUSE used | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
      exit $STATE_OK
    elif [ $CUTPERC -ge $CRITTHRESH ]
      then
      echo "CRITICAL - *$PERCENTINUSE used* | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
      exit $STATE_CRITICAL
    elif [ $CUTPERC -ge $WARNTHRESH ]
      then
      echo "WARNING - *$PERCENTINUSE used* | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
      exit $STATE_WARNING
    fi
  fi
done < /tmp/vmfslist.tmp

echo "No data returned. VMFS Unavailable?"
exit $STATE_CRITICAL


The lesson? There's no real substitute for knowing how stuff works... properly.

No comments:

Post a Comment