Bang Bang Sounds Like Machinery

When RPF Breaks Traceroute

2014-04-02T05:37:00.001-07:00

I came across an interesting little problem recently which was quite fun to unravel....

I work on a hub and spoke network, where most traffic from the spokes follows the default route back to the hubs, except for a few specific destinations which must be reached through a public-network-facing interface.

One day I tried to run a traceroute from one of our site routers, towards one of these specific destinations and found that I couldn't get past the first hop (provider router).

fr-rt01#traceroute 155.231.48.203

Type escape sequence to abort.
Tracing the route to 155.231.48.203

  1 10.197.251.1 4 msec 0 msec 0 msec
  2  *  *  *
  3  *  *  *

Here's the interface config:

interface GigabitEthernet0/0.801
 description Exit to N3 gateway
 encapsulation dot1Q 801
 ip address 10.197.251.4 255.255.255.0
 ip access-group N3-ACCESS-IN in
 ip verify unicast reverse-path
 no ip unreachables
 ip inspect SDM_LOW in
 ip inspect SDM_LOW out
 ip nat outside
 ip virtual-reassembly
 no cdp enable
 crypto map N3-CM
end

So I removed the ACL, then the inspect statements and then the "no ip unreachables", but I still got the same result.

The next hop looked good but I checked my route just to make sure:

fr-rt01#sh ip ro 155.231.48.203
Routing entry for 155.231.48.0/24
  Known via "bgp 65139", distance 200, metric 1, type internal
  Last update from 10.139.202.11 18:44:40 ago
  Routing Descriptor Blocks:
  * 10.139.202.11, from 10.139.202.11, 18:44:40 ago
      Route metric is 1, traffic share count is 1
      AS Hops 0

fr-rt01#sh ip ro 10.139.202.11
Routing entry for 10.139.200.0/22
  Known via "static", distance 1, metric 0
  Routing Descriptor Blocks:
  * 10.197.251.1
      Route metric is 0, traffic share count is 1

Then clutching at straws, I removed my "ip verify unicast reverse-path", and traceroute worked just like it should...

fr-rt01#traceroute 155.231.48.203

Type escape sequence to abort.
Tracing the route to 155.231.48.203

  1 10.197.251.1 4 msec 0 msec 0 msec
  2 81.147.220.170 8 msec 8 msec 8 msec
  3 172.16.216.193 8 msec 8 msec 8 msec
  4 217.36.152.160 8 msec 8 msec 8 msec
  5 217.36.152.64 8 msec 8 msec 8 msec
  6 10.100.100.67 12 msec 8 msec 8 msec

And then I went and read up on traceroute and RPF to try to understand what was going on

When traceroute sends each UDP packet out, it expects to get an ICMP type 11 code 0 (time exceeded) back from each intermediate hop/router, with a source IP address of its exit interface (towards me).

RPF does a reverse lookup of the source address against CEF to check that the receiving interface is one of the best return paths to that address.

Taking the first few addresses in the working traceroute:

fr-rt01#sh ip cef 81.147.220.170 
0.0.0.0/0, version 191479, epoch 0
0 packets, 0 bytes
  via 192.168.13.196, Tunnel13196, 0 dependencies
    next hop 192.168.13.196, Tunnel13196
    valid adjacency

fr-rt01#sh ip cef 172.16.216.193
0.0.0.0/0, version 191479, epoch 0
0 packets, 0 bytes
  via 192.168.13.196, Tunnel13196, 0 dependencies
    next hop 192.168.13.196, Tunnel13196

And there's the problem: without a route to the intermediate addresses, I only have a default route, which leaves by a different interface. So RPF is doing its job properly and breaking my traceroute.

In this case, it seems that the best solution is to tell RPF to ignore my traceroute replies by adding an ACL defining them as exceptions:

fr-rt01#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
fr-rt01(config)#access-list 101 permit icmp any host 10.197.251.4 11 0
fr-rt01(config)#int gi0/0.801
fr-rt01(config-subif)#ip verify unicast reverse-path 101
fr-rt01(config-subif)#^Z
fr-rt01#

Which allows both features to work without side effects:

fr-rt01#traceroute 155.231.48.203

Type escape sequence to abort.
Tracing the route to 155.231.48.203

  1 10.197.251.1 4 msec 0 msec 0 msec
  2 81.147.220.170 8 msec 8 msec 8 msec
  3 172.16.216.193 8 msec 8 msec 8 msec
  4 217.36.152.160 8 msec 8 msec 8 msec
  5 217.36.152.64 8 msec 8 msec 8 msec
  6 10.100.100.67 12 msec 8 msec 8 msec
  7  *  *  *

fr-rt01#sh access-list 101
Extended IP access list 101
    10 permit icmp any host 10.197.251.4 ttl-exceeded (15 matches)

A Flexible RRD Checker for Nagios

2014-03-06T02:57:00.000-08:00

I was asked recently to get Nagios to flag *under*utilisation for a bunch of WAN links.

I had been using a shell script written I think by Garry Cook and Israel Brewster, with a number of hacks to add some extra functionality, but I couldn't get this additional mod going without a complete rewrite.

#!/usr/bin/python
#
# NAME:     check_rrd.py
# AUTHOR:   Philip Damian-Grint
# MODIFIED: 6th March 2014
# VERSION:  0.5
#
# DESCRIPTION:
#   Nagios Plugin to compare utilisation values from an RRD file with 
#   warning and critical thresholds.
#   Features:
#   1.  Threshold units and RRD units can be individually specified.
#       Thresholds default to Kilobytes/sec and RRD units default to Bytes/sec (MRTG default)
#   2.  RRD filepath can be supplied on the command line or via environment variable
#   3.  Threshold direction can be reversed so that low utilisation can also be checked
#   4.  Time period can be specified in minutes, hours, days or months; a basic mean average
#       is taken over multiple records. Defaults to 10 minutes
#   5.  Multipliers used for unit conversion can be decimal (default) or binary
#   6.  Threshold behaviour can be specified so that only one direction, both directions, 
#       any (default) direction, or the sum of both directions can be checked against the threshold.
#   7.  A maximum age of data threshold can be specified
#   6.  An http link (or any text) can be appended to line 1 output.
#
# Notes:
#   1.  This has been tested on a Centos 6.4 system with Nagios v4, rrdtool v1.4.8,
#       and Python 2.6.6
#   2.  All errors prior to fetching data or resulting in invalid or suspect data return UNKNOWN.
#   3.  At present, only AVERAGE values are processed
#   4.  Verbose includes a report on number of empty records, latest timestamp, RRD file processed,
#       threshold behaviour and threshold direction
#
#   Example configuration:
# 
# file: checkcommands.cfg
#
# # Check 7-day average sum of in and out not below supplied thresholds, 
# #   and insert a link to MRTG at the end of line 1
# #'check_under_util' command definition
# define command{
#    command_name    check_under_util
#    command_line    $USER1$/check_rrd -f /usr/local/mrtg/share/rrd/$ARG1$.rrd -w $ARG2$ -c $ARG3$ -r -p 7days -m sumonly -v -l '<a href=/mrtg/cgi-bin/mrtg-rrd.cgi/$ARG1$.html style=font-size:6pt target=_blank>MRTG</a>'
#    }
# 

import argparse
from argparse import RawTextHelpFormatter
import os
import re
import rrdtool
import sys
import time

class CheckRRD(object):
    '''Structure to store key variables and data'''

    # Nagios states - offsets = return code
    states = ('OK', 'WARNING', 'CRITICAL', 'UNKNOWN')
    
    # Units for thresholds and data - offsets used to index into multiplier table
    units = ('b', 'B', 'K', 'M', 'G')

    # 5x5 tables to convert data units into threshold units (b,B,K,M,G rows and columns)
    multi_bin = ((1,8,8192,8388608,8589934592),
                   (0.125,1,1024,1048576,1073741824),
                   (0.00012207,0.00097656,1,1024,1048576),
                   (1.19209E-07,9.53674E-07,0.000976563,1,1024),
                   (1.16415E-10,9.31323E-10,9.53674E-07,0.000976563,1))
    multi_dec = ((1,8,8000,8000000,8000000000),
                   (0.125,1,1000,1000000,1000000000),
                   (0.000125,0.001,1,1000,1000000),
                   (0.000000125,0.000001,0.001,1,1000),
                   (1.25E-10,0.000000001,0.000001,0.001,1))

    def __init__(self):
        self.version = '0.5'
        self.output = ''        # Nagios plugin output line 1
        self.info = ''          # additional output for line 2 onwards
        self.status = CheckRRD.states.index('OK')   # default to successful return code
        self.empty = 0          # number of empty records found in dataset
        self.stale_secs = False # conditional data age check
        self.verbose = False

def parse_args():
    '''Retrieve and sanity-check script arguments'''

    parser = argparse.ArgumentParser(
            description='RRD Threshold Check Script v{0}'.format(rrdchk.version),
            formatter_class=RawTextHelpFormatter,
            epilog='Notes:'
                    + '\n- Warning and critical thresholds are AVERAGE values.'
                    + '\n- Units for stored data and thresholds:'
                    + '\n  "b"=bps, "B"=Bps, "K"=KBps, "M"=MBps, "G"=GBps'
                    + '\n  output BW uses threshold units'
                    + '\n- Threshold behaviour:'
                    + '\n  "inout": both IN and OUT must breach'
                    + '\n  "sum": sum of IN and OUT must breach'
                    + '\n  "inonly"/"outonly": specified threshold must breach'
                    + '\n  "any": either threshold can breach')
    group = parser.add_mutually_exclusive_group(required=True)
    group.add_argument( '-f', dest='rrd_file',
                            action='store',
                            help='rrd file-path')
    group.add_argument( '-e', dest='rrd_env',
                            action='store',
                            help='rrd environment variable')
    parser.add_argument('-r', dest='direction',
                            action='store_true',
                            default=False,
                            help='reverse threshold direction (low check)')
    parser.add_argument('-b', dest='binary',
                            action='store_true',
                            default=False,
                            help='use binary (1024) multiples instead of decimal (1000)')
    parser.add_argument('-m', dest='threshold',
                            action='store',
                            default='any',
                            choices=['any','inout','inonly','outonly','sumonly'],
                            help='threshold behaviour: (any|inout|inonly|outonly|sumonly)')
    parser.add_argument('-l', dest='embedded_link',
                            action='store',
                            help='http link to append to output line 1')
    parser.add_argument('-a', dest='age_check',
                            action='store',
                            default=False,
                            help='data age threshold in seconds')
    parser.add_argument('-w', dest='warning',
                            action='store',
                            required=True,
                            help='warning threshold')
    parser.add_argument('-c', dest='critical',
                            action='store',
                            required=True,
                            help='critical threshold')
    parser.add_argument('-p', dest='period',
                            action='store',
                            default='10minutes',
                            help='time period: N{minutes|hours|days|months}, default 10minutes')
    parser.add_argument('-d', dest='rrd_units',
                            action='store',
                            choices=['b','B','K','M','G'],
                            default='B',
                            help='rrd data units (default Bytes/sec)')
    parser.add_argument('-u', dest='thresh_units',
                            action='store',
                            choices=['b','B','K','M','G'],
                            default='K',
                            help='threshold units (default Kilobytes/sec)')
    parser.add_argument('-v', dest='verbose',
                            action='store_true',
                            help='verbose output')
    
    args = parser.parse_args()

    # Any arguments?
    if len(sys.argv) == 1:
        parser.print_usage()
        return False

    # How much verbosity?
    if args.verbose:
        rrdchk.verbose = True
        rrdchk.info = '\n'

    # Path to RRD file?
    if args.rrd_file:
        rrdchk.rrd_path = args.rrd_file
    elif args.rrd_env:
        try:
            rrdchk.rrd_path = os.environ[args.rrd_env]
        except KeyError:
            return bail('Error reading environment variable {0}'.format(args.rrd_env))
    if rrdchk.verbose:
        rrdchk.info += 'RRD file:{0}'.format(rrdchk.rrd_path)

    # Input and output units
    rrdchk.runits = args.rrd_units
    rrdchk.tunits = args.thresh_units

    # Warning and Critical supplied?
    try:
        rrdchk.warning = int(args.warning)
        rrdchk.critical = int(args.critical)
    except (TypeError,ValueError):
        return bail('Warning ({0}) and Critical ({1}) thresholds must be positive integers'.format(args.warning, args.critical))

    # Threshold higher or lower?`
    if args.direction:
        rrdchk.opt_eq = '<='
        if rrdchk.verbose:
            rrdchk.info += ', checking for LOW threshold'
    else:
        rrdchk.opt_eq = '>='

    # Reasonable time period?
    period = re.match(r'([0-9]+)((?:minutes|hours|days|months))',args.period)
    if not period:
        return bail('Invalid time period')
    elif ((int(period.group(1)) > 12 and period.group(2) == 'months') or
          (int(period.group(1)) > 365 and period.group(2) == 'days') or
          (int(period.group(1)) > 8760 and period.group(2) == 'hours') or
          (int(period.group(1)) > 381600 and period.group(2) == 'minutes')):
        return bail('Unreasonable time period')
    else:
        rrdchk.period = args.period

    # Mandatory thresholds?
    rrdchk.behaviour = args.threshold

    # Binary vs decimal multipliers for calculations?
    if args.binary:
        rrdchk.multipliers = CheckRRD.multi_bin
    else:
        rrdchk.multipliers = CheckRRD.multi_dec

    # Record age check required?
    if args.age_check:
        try:
            rrdchk.stale_secs = int(args.age_check)
        except (TypeError,ValueError):
            return bail('Data age threshold must be a positive integer, when present')

    # Store embedded link if supplied
    if args.embedded_link:
        rrdchk.href = args.embedded_link
    else:
        rrdchk.href = ''

    return True

def fetch_data():
    '''Retrieve traffic samples for requested period'''

    # First check data age
    try:
        rrdchk.rrd_info = rrdtool.info(rrdchk.rrd_path)
    except rrdtool.error,e:
        return bail('Error from RRDTOOL.info: {0}'.format(e))

    if rrdchk.stale_secs:
        if int(time.time()) - rrdchk.rrd_info['last_update'] > rrdchk.stale_secs:
            return bail('Data age check failed: latest dataslot {0} more than minutes ago ({1})'.format(
                    int(rrdchk.stale_secs/60.0),
                    time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(rrdchk.rrd_info['last_update']))))

    # Then pull a dataset
    try:
        ((start_time,
          end_time,
          interval),
         (ds0, ds1),
          rrdchk.dataset) = rrdtool.fetch(rrdchk.rrd_path,
                                         'AVERAGE',
                                         '-s-{0}'.format(rrdchk.period))
    except rrdtool.error,e:
        return bail('Error from RRDTOOL.fetch: {0}'.format(e))

    if rrdchk.verbose:
        rrdchk.info += ', latest dataslot found: {0}'.format(
                time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(rrdchk.rrd_info['last_update'])))

    return True

def normalise_data():
    '''Convert data to same units as thresholds'''

    # Sum dataset, keep track of empty slots
    in_sum = out_sum = 0
    for (in_slot, out_slot) in rrdchk.dataset:
        if in_slot != None and out_slot != None:
            in_sum += in_slot
            out_sum += out_slot
        else:
            rrdchk.empty += 1
    
    if rrdchk.verbose:
        rrdchk.info += ', found {0} empty records out of {1}'.format(rrdchk.empty, len(rrdchk.dataset))

    # Check for empty dataset
    if rrdchk.empty == len(rrdchk.dataset):
        return bail('No records in the time period contained data')

    # Calculate averages
    rrdchk.in_average = in_sum / (len(rrdchk.dataset)-rrdchk.empty)
    rrdchk.out_average = out_sum / (len(rrdchk.dataset)-rrdchk.empty)

    # Convert averages into threshold units
    rrdchk.in_norm = rrdchk.in_average * rrdchk.multipliers[CheckRRD.units.index(rrdchk.tunits)][CheckRRD.units.index(rrdchk.runits)]
    rrdchk.out_norm = rrdchk.out_average * rrdchk.multipliers[CheckRRD.units.index(rrdchk.tunits)][CheckRRD.units.index(rrdchk.runits)]

    return True

def check_threshold():
    '''Carry out threshold checking on normalised data '''

    in_status = 0
    out_status = 0
    sum_status = 0

    # Calculate all possible statuses
    if eval("rrdchk.in_norm {0} rrdchk.warning".format(rrdchk.opt_eq)):
        in_status = CheckRRD.states.index('WARNING')
    if eval("rrdchk.in_norm {0} rrdchk.critical".format(rrdchk.opt_eq)):
        in_status = CheckRRD.states.index('CRITICAL')
    if eval("rrdchk.out_norm {0} rrdchk.warning".format(rrdchk.opt_eq)):
        out_status = CheckRRD.states.index('WARNING')
    if eval("rrdchk.out_norm {0} rrdchk.critical".format(rrdchk.opt_eq)):
        out_status = CheckRRD.states.index('CRITICAL')
    if eval("(rrdchk.out_norm + rrdchk.in_norm) {0} rrdchk.warning".format(rrdchk.opt_eq)):
        sum_status = CheckRRD.states.index('WARNING')
    if eval("(rrdchk.out_norm + rrdchk.in_norm) {0} rrdchk.critical".format(rrdchk.opt_eq)):
        sum_status = CheckRRD.states.index('CRITICAL')

    # Now determine which will contribute to Nagios output

    # ANY - threshold triggered by either threshold
    # INONLY - threshold only triggered if IN thresholds, OUT not checked
    # OUTONLY - threshold only triggered if OUT thresholds, IN not checked
    # INOUT - threshold only triggered if both IN and OUT threshold
    # SUM - threshold only triggered if the sum of IN and OUT thresholds

    # Check IN, ignore OUT
    if rrdchk.behaviour == 'inonly':
        if rrdchk.verbose:
            rrdchk.info += ', IN threshold used, OUT ignored'
        if in_status > rrdchk.status:
            rrdchk.status = in_status
    # Check OUT, ignore IN
    elif rrdchk.behaviour == 'outonly':
        if rrdchk.verbose:
            rrdchk.info += ', OUT threshold used, IN ignored'
        if out_status > rrdchk.status:
            rrdchk.status = out_status
    # Check IN AND OUT
    elif rrdchk.behaviour == 'inout':
        if rrdchk.verbose:
            rrdchk.info += ', Both IN and OUT thresholds used'
        if (out_status > rrdchk.status and in_status > rrdchk.status):
            if in_status >= out_status:
                rrdchk.status = out_status
            else:
                rrdchk.status = in_status
    # Check the sum of IN and OUT
    elif rrdchk.behaviour == 'sumonly':
        if rrdchk.verbose:
            rrdchk.info += ', Sum of IN and OUT thresholds used'
        if sum_status > rrdchk.status:
            rrdchk.status = sum_status
    # default either/or case last
    else:
        if rrdchk.verbose:
            rrdchk.info += ', Either IN or OUT thresholds used'
        if in_status > rrdchk.status:
            rrdchk.status = in_status
        if out_status > rrdchk.status:
            rrdchk.status = out_status

    return True

def bail(msg):
    '''Set status for all processing errors to UNKNOWN'''
    rrdchk.output = 'UNKNOWN - ' + msg
    rrdchk.status = CheckRRD.states.index('UNKNOWN')
    return False

def build_output():
    '''Prepare Nagios-Plugin standard output'''
    rrdchk.output = '{0} - Average BW ({1}) in: {2:.4f}{4}{5}ps, out: {3:.4f}{4}{5}ps {6}'.format(
            CheckRRD.states[rrdchk.status],
            rrdchk.period,
            rrdchk.in_norm,
            rrdchk.out_norm,
            rrdchk.tunits,
            ('b' if rrdchk.tunits == 'b' 
                   else '' if rrdchk.tunits == 'B' 
                   else 'B'),
            rrdchk.href)

    return True

################################
# MAIN starts here
################################
rrdchk = CheckRRD()

if all(check() for check in (parse_args, fetch_data, normalise_data,check_threshold)):
    build_output()
    rrdchk.output += rrdchk.info

print rrdchk.output
sys.exit(rrdchk.status)

Traceroute through Cisco PIX / ASA

2013-04-06T09:09:00.000-07:00

I recently had to clear and redeploy a PIX firewall to a new location, and realised that I had forgotten some of the subtleties involved in getting management and troubleshooting tools to work properly. So this is more of a note to self....

Windows tracert is fairly straight forward and uses pure ICMP with incrementing TTL values. Linux traceroute with the -I switch works the same way.

The firewall is required to allow the following:
Outbound
- Echo Request

Inbound
- Echo Reply
- Time-Exceeded (needed for TTL=0 responses)

Cisco and Linux traceroute by default uses incrementing UDP ports (from 33434) and incrementing TTL values.

The firewall is required to allow the following:
Outbound
- UDP ports 33434 - 33464

Inbound
- Time-Exceeded (needed for TTL=0 responses)
- Destination Unreachable (needed for the final hop port-not-found response)

Putting it all together we get a rule set that looks something like this:


object-group icmp-type ICMP-returns
 description Legit ICMP responses
 icmp-object echo-reply
 icmp-object time-exceeded
 icmp-object unreachable

object-group service Cisco_Traceroute_udp udp
 port-object range 33434 33464

access-list outside_access_in extended permit icmp any object-group External_nets object-group ICMP-returns log disable

access-list inside_access_in remark Permit outbound pings
access-list inside_access_in extended permit icmp object-group Internal_nets any echo log disable

access-list inside_access_in remark Permit traceroute from Cisco devices
access-list inside_access_in extended permit udp object-group Internal_nets any object-group Cisco_Traceroute_udp log disable

Obviously, this assumes sensible values for Internal_nets (e.g. 192.168.0.0/16) and External_nets (i.e. public IP ranges assigned to your external interface)

As an addendum, the Firewall is not (strictly speaking) a router, and therefore in many cases will not decrement the TTL. I have found this unnecessary in most cases, but if needed, can be enabled as follows:


policy-map global_policy
 class class-default
  set connection decrement-ttl

Putty Class in VBScript

2012-10-12T01:16:00.000-07:00

We have a fairly busy network, comprising several hundred Cisco devices across some fifty sites, and putty is one of my mainstay tools for updating configs and general troubleshooting.

So when I started looking around for something quick and easy to carry out batched updates, I looked at Putty first. Using Putty for scripted tasks wasn't as easy as I thought it would be, the main problem being access to screen feedback so that I can verify that my commands have had the expected effect.

One solution is to turn on logging and use that as a proxy screen. Here's a VBScript class which includes some basic send and "receive" functionality. Error handling is stripped to a bare minimum to keep the size of the script down here, but hopefully it gives a flavour of what is possible.

Option Explicit
'===========================================================================
'Name:    Putty class
'Author:  Philip Damian-Grint
'Version: 1.0
'Date:    12th Oct 2012
'
'Description:
'
'  A starter VB class used to drive Putty sessions typically for Cisco
'  devices, allowing sending of commands, and returning screen output
'  to allow the possibility of conditional processing.
'
'  Putty has a number of logging options; for Cisco vty sessions, only 
'  printable output is required for line-based output processing, but 
'  full session output at least is required where escape sequences
'  need to be captured for screen positioning. (Not demonstrated here)
'===========================================================================

' Constants

Const EXELOC            = """c:\Program Files\Linux Utilities\PuTTY\putty.exe"""
Const LOG_PRINT         = "1"
Const LOG_SESSION       = "2"
Const MODE_LINE         = 0
Const MODE_CHAR         = 1
Const REGPUTTY          = "HKCU\Software\SimonTatham\PuTTY\Sessions\Default%20Settings\"
Const REGLGFILE         = "HKCU\Software\SimonTatham\PuTTY\Sessions\Default%20Settings\LogFileName"
Const REGLGTYPE         = "HKCU\Software\SimonTatham\PuTTY\Sessions\Default%20Settings\LogType"
Const STATUS_SUCCESS    = 0
Const STATUS_FAILURE    = -1

Class Putty

  'CLASS PRIVATE VARIABLES
  Private p_iLastTideMark
  Private p_iMode
  Private p_iStatus
  Private p_iWait
  Private p_oFSO
  Private p_oSession
  Private p_oWShell
  Private p_sEnable
  Private p_sHost
  Private p_sLogName
  Private p_sLogType
  Private p_sPasswd
  Private p_sTempDir
  Private p_sUser

  'CLASS CREATOR & DESTRUCTOR

  Private Sub Class_Initialize()
    Set p_oWShell = WScript.CreateObject( "WScript.Shell" )
    Set p_oFSO = WScript.CreateObject( "Scripting.FileSystemObject" ) 
    p_sLogType = LOG_PRINT ' default to printable output
    p_iLastTideMark = 0 ' initial tide mark
    p_iWait = 5         ' default to 5 seconds wait after each command
    p_iMode = MODE_LINE ' default to reading lines
  End Sub

  Private Sub Class_Terminate()
    ResetLog()                       ' Clear our registry settings
    p_oFSO.DeleteFile( p_sLogName )  ' Get rid of the temporary file
    Set p_oWShell = Nothing
    Set p_oFSO = Nothing
    Set p_oSession = Nothing
  End Sub

  'CLASS PROPERTIES
 
 'enable() is WO
  Public Property Let enable( sEnable ) : p_sEnable = sEnable : End Property

 'host() is RW
  Public Property Let host( sHost ) : p_sHost = sHost : End Property
  Public Property Get host() : host = p_sHost : End Property

 'logtype() is RW
  Public Property Let logtype( sLogType ) : p_sLogType = sLogType : End Property
  Public Property Get logtype() : logtype = p_sLogType : End Property

 'mode() is RW
  Public Property Let mode( iMode ) : p_iMode = iMode : End Property
  Public Property Get mode() : mode = p_iMode : End Property

 'passwd() is WO
  Public Property Let passwd( sPasswd ) : p_sPasswd = sPasswd : End Property

 'status() is RO
  Public Property Get status() : status = p_iStatus : End Property

 'user() is RW
  Public Property Let user( sUser ) : p_sUser = sUser : End Property
  Public Property Get user() : user = p_sUser : End Property

 'wait() is RW
  Public Property Let wait( iWait ) : p_iWait = iWait : End Property
  Public Property Get wait() : user = p_iWait : End Property

  'CLASS PRIVATE FUNCTIONS
  
  Private Function EnableLog ' Switch on Putty logging
    EnableLog = -1
    p_sLogName = p_oWShell.ExpandEnvironmentStrings( "%Temp%" ) & _
                "\" & p_oFSO.GetTempName()        
    If IsEmpty( p_oWShell.RegWrite( REGLGFILE, p_sLogName,"REG_SZ" ) ) AND _
       IsEmpty( p_oWShell.RegWrite( REGLGTYPE, p_sLogType, "REG_DWORD" ) ) Then
            EnableLog = 0
    End If
  End Function

  Private Function Quit( sReason ) ' Display message and Exit
    WScript.Echo sReason : WScript.Quit
  End Function

  Private Function ResetLog ' Switch off Putty logging
    p_oWShell.RegDelete( REGPUTTY )
  End Function

  Private Function ReadLog ' Read latest output from Putty log
    Dim oFile : Set oFile = p_oFSO.OpenTextFile( p_sLogName )
    Dim iCount : iCount = 0
    Dim aLogLines(), sLogChars

    Do Until oFile.AtEndOfStream    ' Find our old tide mark
        If iCount < p_iLastTideMark Then
            oFile.SkipLine
        Else
            Redim Preserve aLogLines( iCount - p_iLastTideMark ) 
            aLogLines( iCount - p_iLastTideMark ) = oFile.ReadLine
        End If
       iCount = iCount + 1
    Loop
    p_iLastTideMark = iCount        ' New tidemark
    ReadLog = aLogLines             ' Return everything since the last tidemark
    oFile.Close
    Set oFile = Nothing
  End Function

  Private Function SendInput( sInput )  ' find Putty's active window and send keystrokes to it
    WScript.Sleep 3000               ' Or greater if debugging to give time for window switching
    Do 
        WScript.Sleep 100
    Loop until p_oWShell.AppActivate( p_oSession.ProcessID )    ' Find our session window
    p_oWShell.SendKeys( sInput & "{ENTER}" )        ' Do the deed
  End Function

  'CLASS METHODS

  Public Function Connect  ' Launch Putty 
    p_iStatus = STATUS_FAILURE          ' assume failure
    If (NOT IsEmpty( p_sUser ) AND _
            NOT IsEmpty( p_sPasswd ) AND _
            NOT IsEmpty( p_sUser ) AND _
            NOT IsEmpty( p_sHost ) ) Then
        If EnableLog <> 0 Then Quit( "Aborting - Can't update registry" )
        On Error Resume Next            ' graceful error handling
        Set p_oSession = p_oWShell.exec( EXELOC & " " & p_sHost & " -l " & _
                        p_sUser & " -pw " & p_sPasswd )
        WScript.Sleep 2000              ' Allow some time to settle down
        If ( ( p_oSession Is Nothing ) OR ( p_oSession.Status <> 0 ) ) Then Exit Function
        On Error Goto 0
        p_iStatus = STATUS_SUCCESS
        Connect = ReadLog()             ' Pass the initial screen back
    End If
  End Function

  Public Function Send( sChars ) ' Send a command and read the output after waiting iWait seconds
    SendInput( sChars )
    WScript.Sleep p_iWait * 1000
    Send = ReadLog()
  End Function

End Class

And to demonstrate the class in use, we take the code above and store it in a file called "classes.vbi", and then pull that file in using an "Include" function to our puttytest.vbs below.

All this demo does is log onto a cisco device, send a command and logout, relaying any putty screen output to our screen:

Tested with Putty version 0.60 under WIndows XP SP3:

Option Explicit
'===========================================================================
'Name:    puttytest.vbs
'
'Description:
'
'   Wrapper to test our putty class
'   Run from command line:
'        cscript  puttytest.vbs
'===========================================================================
 
'Utility Functions

Function Include ( sFileVBI ) ' include an external vbs/vbi file
    Dim oFSO : Set oFSO = WScript.CreateObject( "Scripting.FileSystemObject" )
    Dim oFile : Set oFile = oFSO.OpenTextFile( sFileVBI )
    ExecuteGlobal oFile.ReadAll()
    oFile.Close : Set oFile = Nothing
    Set oFSO = Nothing
End Function

Function GetUserInfo( sPrompt ) ' prompt for input
    WScript.StdOut.Write( sPrompt )
    GetUserInfo = WScript.StdIn.ReadLine
End Function

Function GetPassword( sPrompt ) ' prompt for hidden input
    Dim oPasswd : Set oPasswd = WScript.CreateObject( "ScriptPW.Password" )
    WScript.StdOut.Write( sPrompt )
    GetPassword = oPasswd.GetPassword()
    Set oPasswd = Nothing
End Function

Function WriteLines( aOut ) ' print array of strings
    Dim sLine : For Each sLine in aOut
        WScript.StdOut.Write( sLine & VbCrLf )
    Next
End Function

'========================
' Test our Putty Class
'========================

Include "classes.vbi"

Dim aOutPut
Dim sLineOut
Dim sTextToSend

Dim oSession : Set oSession = New Putty     ' Create a new instance of our class

oSession.host = GetUserInfo( "Please type hostname: " )  ' Get some basic info
oSession.user = GetUserInfo( "Please type username: " )
oSession.passwd = GetPassword( "Please type password: " )

aOutPut = oSession.Connect                  ' and launch our putty session

If oSession.Status = STATUS_SUCCESS Then

    WriteLines( aOutPut )
    oSession.wait = 3                       ' we can set a timer for each command
    aOutPut = oSession.Send( "show ver" )   ' show version IOS command
    WriteLines( aOutPut )
    aOutPut = oSession.Send( " " )          ' usually runs to 2 screens
    WriteLines( aOutPut )
    aOutPut = oSession.Send( "logout" )     ' close session
    WriteLines( aOutPut )

Else

    WScript.Echo "Failed to launch Putty"
    
End If

Two-way NAT / PAT on a VPN (Cisco) Stick

2012-08-04T10:47:00.000-07:00

Some time ago I was tasked with interfacing to a couple of other multi-site organisations across a large governmental network similar in operation to the Internet. This was an interim measure prior to integrating aspects of the three networks into a single entity, and prior to having any dedicated WAN links in place.

I had to provide connectivity between a variable number of users and servers across all three networks, and with many overlapping IP ranges in place. The idea was to have a flexible enough configuration that I could easily add and change routes at the far end to keep pace with any integration work.

The last requirement was the support of one or more AD trusts between the organisations, with DNS forwarding.

To make it as light a touch a possible for the far-end IT departments, I went for a single interface Cisco router that could be connected directly to a Firewall DMZ on the far end firewall.

The topology essentially looked like this (with extraneous devices stripped away) and Internet substituted for the (private) government network:

At a more useful level, showing physical interfaces and IP addresses:

NAT

Ideally, in order to allow variable numbers of users to cross the NAT boundary in either direction, one would be able to use PAT in both directions. However, this is only available to the “inside” interface.
As a large number of users were likely to be crossing from the remote site, and only a few from the hub site, I had to make the remote physical interface act as “inside” and the tunnel act as “outside”.
This allowed me to use PAT for remote users and dynamic NAT for hub users.
Servers were easily handled by static NAT in both directions

Non-NAT-Compliant Applications

Nowadays, most applications, including to my surprise, Microsoft domain trusts, work quite well across (Cisco IOS) NAT boundaries. I found only one application which didn’t: an old version of HP Openview ServiceDesk, which embeds the source IP of the HPOV server inside the java client for use in a subsequent return connection.
In this particular instance, the server was based at the hub site, and no IP conflict existed at the remote site. I was able to create an identity NAT for the server in the direction of the Hub which worked fine once supporting routes were in place.

MTU issues

Because the remote firewall has not participated in creating the tunnel endpoint, it can’t respond correctly to hub-destined traffic with DF flag set, so we have to ensure that the remote firewall allows ICMP unreachables to be sent from our router to devices on its internal network.

Design Notes

Some basic notes might be required to clarify where all the addresses are coming from:

IPSEC and GRE tunnel end-points

The physical interfaces representing the Hub and Remote endpoints have internal 192.168 addresses and are mapped to NAT addresses on their upstream firewalls. I have used 10.1.1.10 and 10.2.2.10 respectively.

Inter-Org NAT Allocation

Subnet 192.168.198.0/24 is used to Dynamically NAT all Hub users accessing Remote servers
Subnet 192.168.199.0/24 is used to present Hub servers to Remote users.
IP Address 192.168.200.1 is used to PAT all Remote users accessing Hub servers
Subnet 192.168.200.0/24 is used to present Remote servers to Hub users.

Configuration Fragments

The configurations below have been taken from working devices, with some minimal IP address obfuscation:

Hub Distribution Router:

! IKE Phase 1
crypto isakmp policy 1
 encr aes
 authentication pre-share
 group 5

! Pre-shared key for remote site
crypto isakmp key RemoteSiteKey address 10.2.2.10

! IKE Phase 2
crypto ipsec transform-set AES256_SHA_tra esp-aes 256 esp-sha-hmac 
 mode transport

! Crypto ACL for GRE to remote site
ip access-list extended HUB-INTERNET-REMOTE-CRYACL
 remark Tunneled traffic over the Internet to Remote site
 permit gre host 192.168.12.9 host 10.2.2.10

! Crypto MAP entry for remote site
crypto map INTERNET-CM 15 ipsec-isakmp 
 set peer 10.2.2.10
 set transform-set AES256_SHA_tra 
 match address HUB-INTERNET-REMOTE-CRYACL

! Physical interface for termination of all WAN and Internet tunnels
interface GigabitEthernet0/1
 description Connects to Local Firewall inside
 ip address 192.168.12.9 255.255.255.248
 crypto map INTERNET-CM

! Tunnel to remote site
interface Tunnel14200
 description Tunnel over Internet to Remote Site
 ! low bandwidth used (EIGRP) for backup tunnels over Internet
 bandwidth 1000
 ip address 192.168.14.201 255.255.255.252
 ! Maximum starting MTU (1500 - 8(NAT-T) - 53(AES256) - 24(GRE))
 ip mtu 1415
 ! high delay used (EIGRP) for backup tunnels over Internet
 delay 2000
 tunnel source GigabitEthernet0/1
 tunnel destination 10.2.2.10
 ! Tell GRE to copy DF from inner to outer IP header
 tunnel path-mtu-discovery

ip prefix-list EIGRP-SITETUNNELS-OUT-PL description Route adverts to remote sites
ip prefix-list EIGRP-SITETUNNELS-OUT-PL seq 5 permit 0.0.0.0/0
ip prefix-list EIGRP-SITETUNNELS-OUT-PL seq 10 permit 192.168.0.0/16 le 32

router eigrp 192
 passive-interface GigabitEthernet0/1
 network 192.168.12.0
 network 192.168.14.0
 distribute-list prefix EIGRP-SITETUNNELS-OUT-PL out Tunnel14200
 no auto-summary
 no eigrp log-neighbor-changes

Hub Firewall:

PIX Version 7.2(2)

name 10.1.1.10 dist-rt02_INTERNET
name 192.168.12.9 dist-rt02_G01

object-group service NAT-T udp
 description NAT Traversal
 port-object eq 4500

object-group service IPsec_udp udp
 description UDP protocols used by IPsec
 group-object NAT-T
 port-object eq isakmp

object-group network Cisco_Devices
 description Cisco devices' Internet interfaces
 network-object host remote-rt01_INTERNET

interface Ethernet0
 speed 100
 duplex full
 nameif outside
 security-level 0
 ip address 10.1.1.4 255.255.255.0 standby 10.1.1.5 

interface Ethernet1
 speed 100
 duplex full
 nameif inside
 security-level 100
 ip address 192.168.12.12 255.255.255.248 standby 192.168.12.13

route outside 0.0.0.0 0.0.0.0 10.1.1.1 1
route inside 192.168.0.0 255.255.0.0 192.168.12.9 1

! Mapping the routable Tunnel endpoint
static (inside,outside) dist-rt02_INTERNET dist-rt02_G01 netmask 255.255.255.255 

access-list inside-access-in remark Allow ISAKMP & NAT-T to sites using VPN-over-Internet
access-list inside-access-in extended permit udp host dist-rt02_G01 object-group Cisco_Devices object-group IPsec_udp log disable 

access-group inside-access-in in interface inside

Remote Firewall:

PIX Version 6.3(4)

interface ethernet0 100full
interface ethernet1 100full
interface ethernet4 100full

nameif ethernet0 outside security0
nameif ethernet1 inside security100
nameif ethernet4 HUBDMZ security49

ip address outside 10.2.2.4 255.255.255.0
ip address inside 192.168.0.5 255.255.254.0
ip address HUBDMZ 192.168.22.9 255.255.255.248

failover ip address outside 10.2.2.5
failover ip address inside 192.168.0.6
failover ip address HUBDMZ 192.168.22.10

object-group network HUB
  description HUBDMZ network
  network-object 192.168.22.8 255.255.255.248 
  description HUB users on this subnet
  network-object 192.168.198.0 255.255.255.0 
  description HUB servers on this subnet
  network-object 192.168.199.0 255.255.255.0 

! Minimal ACLs to permit traffic flow – not representative!
access-list inside_access_in permit ip any object-group HUB 
access-group inside_access_in in interface inside

access-list hubdmz_access_in permit icmp any any
access-list hubdmz_access_in permit ip host 192.168.22.13 host 10.1.1.10 
access-list hubdmz_access_in permit ip object-group HUB 192.168.0.0 255.255.0.0 
access-group hubdmz_access_in in interface HUBDMZ

access-list outside_access_in permit udp host 10.1.1.10 host 10.2.2.13 eq 4500 
access-list outside_access_in permit udp host 10.1.1.10 host 10.2.2.13 eq isakmp 
access-group outside_access_in in interface outside

! Bypass NAT for incoming HUB traffic (low security to high security)
access-list NO_NAT_HUBDMZ permit ip object-group HUB 192.168.0.0 255.255.0.0 
nat (HUBDMZ) 0 access-list NO_NAT_HUBDMZ

! Mapping the routable Tunnel endpoint
static (HUBDMZ,outside) 10.2.2.10 192.168.22.13 netmask 255.255.255.255 0 0 

! Hub users and servers respectively
route HUBDMZ 192.168.198.0 255.255.255.0 192.168.22.13 1
route HUBDMZ 192.168.199.0 255.255.255.0 192.168.22.13 1

Remote VPN Router:


! example hub hosts with pre(real) and post nat addresses (hub perspective)
ip host hubhost01 192.168.4.10 192.168.199.5
ip host hubhost02 192.168.4.20 192.168.199.6
! example remote hosts with "pre" and "post"(real) nat (hub perspective)
ip host remotehost01 192.168.200.7 192.168.4.50
ip host remotehost02 192.168.200.8 192.168.4.51

! need inspection to activate ALGs
ip inspect name INSPECT_LIST dns
ip inspect name INSPECT_LIST ftp
ip inspect name INSPECT_LIST https
ip inspect name INSPECT_LIST icmp
ip inspect name INSPECT_LIST imap
ip inspect name INSPECT_LIST pop3
ip inspect name INSPECT_LIST esmtp
ip inspect name INSPECT_LIST sqlnet
ip inspect name INSPECT_LIST streamworks
ip inspect name INSPECT_LIST tftp
ip inspect name INSPECT_LIST tcp
ip inspect name INSPECT_LIST udp
ip inspect name INSPECT_LIST vdolive
ip inspect name INSPECT_LIST kerberos
ip inspect name INSPECT_LIST ldap
ip inspect name INSPECT_LIST microsoft-ds

! IKE Phase 1
crypto isakmp policy 1
 encr aes
 authentication pre-share
 group 5

! Pre-shared key for this site
crypto isakmp key RemoteSiteKey address 10.1.1.10

! IKE Phase 2
crypto ipsec transform-set AES256_SHA_tra esp-aes 256 esp-sha-hmac 
 mode transport

! Crypto ACL for GRE to hub site
ip access-list extended REMOTE-INTERNET-HUB-CRYACL
 remark Traffic tunnelled over Internet to HUB
 permit gre host 192.168.22.13 host 10.1.1.10

! Crypto MAP entry for hub site
crypto map INTERNET-CM 2 ipsec-isakmp 
 set peer 10.1.1.10
 set transform-set AES256_SHA_tra 
 match address REMOTE-INTERNET-HUB-CRYACL

! Single physical interface for LAN and VPN traffic
! in/out ACL not included in config
interface FastEthernet0/0
 description Exit to Internet and Remote LAN via Remote DMZ
 ip address 192.168.22.13 255.255.255.248
 no ip redirects
 ip inspect INSPECT_LIST in
 ! Treat the remote network as inside so we can use PAT
 ip nat inside
 ! enabled automatically with NAT config
 ip virtual-reassembly
 duplex full
 speed 100
 no cdp enable
 crypto map INTERNET-CM

interface Loopback0
 description Remote PAT address for overlapping client subnets
 ip address 192.168.200.1 255.255.255.0

interface Tunnel14200
 description Tunnel over Internet to Hub network
 ! low bandwidth used (EIGRP) for backup tunnels over Internet
 bandwidth 1000
 ip address 192.168.14.202 255.255.255.252
 ! Maximum starting MTU (1500-8(NAT-T)-53(AES256)-24(GRE))
 ip mtu 1415
 ! Required to allow PAT in the opposite direction
 ip nat outside
 ! enabled automatically with NAT config
 ip virtual-reassembly
 ! high delay used (EIGRP) for backup tunnels over Internet
 delay 2000
 tunnel source FastEthernet0/0
 tunnel destination 10.1.1.10
 tunnel path-mtu-discovery

router eigrp 192
 passive-interface Loopback0
 network 192.168.14.0
 network 192.168.200.0
 distribute-list prefix EIGRP-TUNNEL-OUT-PL out Tunnel14200
 no auto-summary
 eigrp stub connected

! Floating default route back to the hub over the tunnel
ip route 0.0.0.0 0.0.0.0 192.168.14.201 200

! Example remote site networks - 192.168.4.0 chosen to demonstrate overlaps
ip route 192.168.0.0 255.255.254.0 192.168.22.9
ip route 192.168.4.0 255.255.254.0 192.168.22.9
ip route 192.168.35.0 255.255.254.0 192.168.22.9

! Explicit route for our tunnel destination to avoid recursion
ip route 10.1.1.0 255.255.255.0 192.168.22.9

! We need the flexibility of PAT to be applied to the remote network
ip nat inside source list REMOTE-USERS interface Loopback0 overload

! Which leaves us on the "outside" using dynamic NAT
ip nat pool HUB-POOL 192.168.198.1 192.168.198.254 prefix-length 24
ip nat outside source list HUB-USERS pool HUB-POOL

! Example remote servers - DNS ALG will use these to translate our queries
ip nat inside source static 192.168.4.50 192.168.200.7
ip nat inside source static 192.168.4.51 192.168.200.8

! Example hub servers - DNS ALG will use these to translate their queries
ip nat outside source static 192.168.4.10 192.168.199.5
ip nat outside source static 192.168.4.20 192.168.199.6

! Define which remote subnets hide behind PAT
ip access-list standard REMOTE-USERS
 remark Remote main site
 permit 192.168.0.0 0.0.1.255
 remark Remote secondary site example
 permit 192.168.35.0 0.0.1.255

! Define which hub subnets hide behind Dynamic NAT
ip access-list standard HUB-USERS
 remark Hub IT department
 permit 192.168.32.0 0.0.0.255
 remark Hub main site
 permit 192.168.125.0 0.0.0.255
 remark Hub secondary site example
 permit 192.168.35.0 0.0.0.255

ip prefix-list EIGRP-TUNNEL-OUT-PL description Routes to be advertised from site
ip prefix-list EIGRP-TUNNEL-OUT-PL seq 5 permit 192.168.200.0/24

Postscript

The creation and ongoing support of Microsoft domain trusts across this two-way NAT boundary was reasonably straight forward. There were a couple of issues, neither of which were caused by or really impinged upon the configuration itself, but might be worth mentioning:

1. Problems creating a domain Trust across two-way NAT
I found it useful to ensure that all DNS servers in both domains could see and forward to each other. In one of the organisations I needed to connect to, this was tiresome because they had at least 6 DCs of which 4 were DNS servers. This requires static NAT entries to be configured for each server.
I also found that physical DCs were more reliable than VMs, in part due to VMware tools not being installed thoughtfully - the Shared Folders option should not be installed as it causes network (RPC) problems. However, you can't chose in advance which DCs will participate on each side, so it becomes useful to be able to mask off the suspect ones by removing their NAT entries.

2. Kerberos-related fragmentation
Depending upon the server and workstation versions, Kerberos may still default to UDP, which may cause performance problems due to fragmentation. This is particularly noticable where W2K3 and XP are in use, and where there are many nested groups and SID histories to bloat the packets. This manifests itself as a delay in accessing resources across the trust. Debugging ip virtual-reassembly may show maximum fragments or fragmentation buffer being exceeded and some additional tweaking may be required to prevent timeouts and retransmissions within Kerberos.

Sources

I found the following document very useful in getting this to production:
NAT Order of Operation

MRTG Log Aggregator

2012-02-04T16:56:00.000-08:00

Occasionally, I have needed to provide percentiles on a combined set of interfaces.
This requires a way of adding together samples from a number of log files, even though the sample timestamps might differ from file to file by a few minutes.

Here then is my current hack for doing this. The merged data set is implemented here as a doubly-linked list using nested hashes, not because I make use of these here, but because I lifted it from one of my other log manipulation tools. I will probably return to clean it up as time goes on.

#!/usr/bin/env perl
#
# NAME:         aggregate.pl
#
# AUTHOR:       Philip Damian-Grint
#
# DESCRIPTION:  Synthesize a new MRTG log file from 2 or more other log files.
#
#               This utility expects and generates version 2 MRTG log files,
#               (See http://oss.oetiker.ch/mrtg/doc/mrtg-logfile.en.html), based on a 
#
#               default sampling time of 5 minutes
#               In general there are 600 samples each of 5mins, 30mins, 120mins 
#               and 86400mins. Each dataset is a quintuple:
#               {epoch, in_average, out_average, in_maximum, out_maximum}
#
#               The file with the newest timestamp is used as a template for generating
#               the output file, processed backwards in time.
#
#               Samples from the second and further logfiles are combined with the template
#               according to the following rules:
#
#               1.  Samples from the input logfile which fall between two samples in the
#                   template, are combined into the sample with the higher timestamp
#
#               2.  Samples are combined using basic addition only
#
#               Each of the input files are checked for time synchronisation. If the
#               starting times of any of the second and subsequent input files are more 
#               than 5 minutes adrift from the first input file, the utility aborts.
#
# INPUTS:       Options, Logfile1, Logfile2, ...
#               aggregate.pl [--verbose] Logfile1 [, Logfile2, ...]
#
# OUTPUTS:      Logfile in MRTG format version 2
#               This is written to STDOUT
#
# NOTES:        1.   It should go without saying that running this against live log files while
#                    MRTG is running will have unpredictable results - copy the logfiles to
#                    a location where they will not be disturbed while being processed.
#
#               2.  It is possible that due to occasional variations at sample period
#                   boundaries (e.g. 5mins / 30 mins) and between files, some "samples" in the
#                   merged file might combine one or two samples more than expected.
#                   It would be possible to avoid this by say, adding a further field to each hash
#                   record to count and possibly restrict the samples combined from subsequent files.
#
# HISTORY:      3/2/2012: v1.0 created
#               8/2/2012: v1.1 header detection corrected
#

# PRAGMAS
use strict;

# GLOBALS
local $| = 1;                               # Autoflush STDOUT

# MODULES
use Getopt::Long;

# VARIABLES

# Parameters
my $verbose;

# Working Storage
my @fields;                                 # Holds fields from last record read
my $file_no;                                # Tracks current file being processed
my $inbytes_master;                         # Inbytes counter from the first file
my @keys;                                   # Holds sorted keys for merged dataset
my $outbytes_master;                        # Outbytes counter from the first file
my $prev_time;                              # Remember our previous timestamp
my $record_no;                              # Tracks last record read from current file
my $time_master;                            # First timestamp from first file
my $run_state;                              # Tracks processing phase (first file, subsequent file...)
my %samples;                                # Doubly-linked list representing merged file

# Subroutines
sub record_count {
    print "\r".++$record_no." of ".$file_no;
}

# INITIALISATION

GetOptions ("verbose" => \$verbose );       # Check for verbosity
$prev_time = 0;                             # Reset previous timestamp copy
$run_state = 'INIT';                        # Reset state
$time_master = 0;                           # Reset starting epoch

# MAIN BODY

# Process All Logfiles
while (<>) {
    chomp();                                # Remove carriage return etc
    @fields = ();                           # Clear our temporary holding area
    @fields = (split);                      # Split up our tuple

    # Start of File Processing    
    if (scalar(@fields) == 3) {             # Check for start of file
        print "\nStart of input file, datestamp: ".(scalar localtime(@fields[0]))."\n" if ($verbose);
        $record_no = 0;                     # Reset record counter

        # First file
        if ($run_state eq 'INIT') {         # If this is our first file
            $time_master = @fields[0];      # Capture the header timestamp
            $inbytes_master = @fields[1];   # Capture the header inbytes
            $outbytes_master = @fields[2];  # Capture the header outbytes
            $run_state = 'FIRST';           # And update our state
            $file_no = 1;                   # Start counting input files

        # Subsequent files
        } else {
            # At the end of the first file (only)
            if ($run_state eq 'FIRST') {
                @keys = reverse sort { $a <=> $b } (keys %samples); # Sort our keys
                $run_state = 'SUBSQ';                               # Note that first file has ended
            }
            # And in all cases
            $file_no++;                     # Count input files
            $inbytes_master += @fields[1];  # Add header inbytes to master
            $outbytes_master += @fields[2]; # Add header outbytes to master
            
            # Other files must be within 5 minutes of the first
            die("Header timestamp difference > 5 minutes found in file ".$file_no."\n") if (abs($time_master - @fields[0]) > 300);
        }
        &record_count if ($verbose);        # Update our on-screen counter
        $prev_time = @fields[0];            # Take a copy of this timestamp
        next;                               # Now start processing non-header records
    }

    # Check for "all-files" data mangling
    die("\nIncreasing timestamp found in record ".$record_no." of file ".$file_no."\n") if (@fields[0] > $prev_time);
        
    # First file just populates our template
    if ($run_state eq 'FIRST') {

        # Check for "first-file" data mangling
        die("\nDuplicate timestamp found in record ".$record_no." of file ".$file_no."\n") if (exists ($samples{@fields[0]}));

        # Create a hash entry indexed by datestamp
        $samples{@fields[0]}= {PREV => ($prev_time == @fields[0]) ? undef : $prev_time, NEXT => undef, TUPLE => [@fields[1], @fields[2], @fields[3], @fields[4]]};

        # If not the first item in the list, update the last item's NEXT pointer
        $samples{$prev_time}{NEXT} = @fields[0] if ($record_no > 1);

    # Subsequent files must be merged
    } else {
        foreach (@keys) {
            if ($_ <= @fields[0]) {
                $samples{$_}{TUPLE}[0] += @fields[1];
                $samples{$_}{TUPLE}[1] += @fields[2];
                $samples{$_}{TUPLE}[2] += @fields[3];
                $samples{$_}{TUPLE}[3] += @fields[4];
                last;
            } 
        }
    }
    $prev_time = @fields[0];                # Take a copy of this timestamp
    &record_count if ($verbose);
}

# Were we only given one file? @keys only populated on detection of a second file
die("\nError - only one input file supplied\n") unless (@keys);

# Output Merged File

# First our updated header record
print "$time_master $inbytes_master $outbytes_master\n";

# And then our records in reverse order
foreach (@keys) {
    print "$_ $samples{$_}{TUPLE}[0] $samples{$_}{TUPLE}[1] $samples{$_}{TUPLE}[2] $samples{$_}{TUPLE}[3]\n";
}

IPSEC: Tunnel vs Transport Mode

2012-01-19T05:25:00.000-08:00

If you go looking for it, there is whole stack of IPSEC documentation out there. It's mostly fairly dense, and tends to concentrate on explaining the somewhat complex operation and configuration details rather than exploring design choices.

One typical scenario is that you find yourself tasked with managing a multisite topology with redundant paths, and over third-party provider networks (private and public). The result is a requirement to implement encryption for all intersite traffic, for which the usual, and often only contender is IPSEC.

Those implementing IPSEC for the first time find that there are a large number of choices to be made, and all of them may seem to be equally important. As a result, the final implementation often bears a strong resemblance to one of the examples which can be found on the Cisco site (with all the subtleties hidden and important decisions pre-made).

One key decision involves the choice of operating mode: Tunnel or Transport.

Typically you find the differences between the two described in a number of ways such as:

Tunnel mode is used between gateways while Transport mode is used between endstations.
Tunnel mode is used for pass-through traffic, while Transport mode is used for end-to-end traffic
Tunnel mode encrypts the whole packet and provides a new header, while Transport mode only encrypts the data (payload).

These descriptions have hints and clues inside them but they don't really tell you why and when you should use them. But once you understand what the basic choice means, IPSEC suddenly becomes a lot more friendly.

Here's the question I think you should be asking:
"Which mode will best support my routing model?"

So, do you use dynamic routing or static routing? This is important because some of the same reasoning you use to justify your routing choice will be the same reasoning you use in making the Tunnel vs Transport choice.

Let's look at the static routing approach:
"I have a simple network and by using static routes I have complete control over what traffic is sent across my links."

So each static route is created at the 'source' of a link, directing traffic to the other end. This is much the same as a typical IPSEC Tunnel-mode link which uses ACLs to define "interesting traffic" at the 'source', to be sent to the other end.

But in order to get your traffic to traverse that IPSEC link, you must have a static route, and you must have a corresponding (crypto) ACL present. Without both in place, the forwarding won't happen. They must be matched by a mirror route/ACL at the other end. So that's four manual entries for each definable flow which must be updated if subnets or paths change.

What about the dynamic routing approach?:
"I have a complex network with multiple paths which I want to be discovered and utilised as needed by the network"

So someone taking this approach doesn't really want to be clumsily routing by ACL, but dynamically with a routing protocol. Trouble is, IPSEC Tunnel mode only handles unicast traffic, which would leave you with BGP as the only usable routing protocol.

You don't really want your routing configuration to have any dependencies on your IPSEC configuration at all. This is where Transport mode comes in:

Create an IPSEC Transport mode link between your pair of site routers and use it to carry only GRE traffic to create a GRE tunnel with its own /30 subnet and addressing, independent of the IPSEC link addressing.

Now, multicast routing protocols such as OSPF and EIGRP will run over the link and take care of all the other traffic.

The only ACL you need to define for IPSEC is one that identifies the peer router for GRE traffic, which won't change even if routing paths and subnet locations do. So that's four manual entries to take care of any number of definable flows, and which don't need to change unless one of the two site routers actually changes its address.

And that's all there is to it. (at a high level).

So in summary:

Tunnel mode IPSEC forces you to implement "Routing by Crypto-Map", which is ugly and unscalable, but appropriate for links between your external firewall and some other organisation, for instance.
Transport mode IPSEC (+GRE) frees up the routing design and makes it independent of encryption implementation; it is therefore ideal for any internal links, WAN or LAN.

This is in some ways, counter intuitive: Use Transport mode to carry tunnels and use Tunnel mode to transport raw packets.

PS: Don't forget if you use Transport mode IPSEC with GRE, that there are now two layers of encapsulation and you will need to take extra care with fragmentation and MTU issues. At a minimum you should have path MTU discovery enabled, ICMP unreachables NOT blocked and DF bits copied from the original IP header to the GRE header.

Anatomy of a Netgear WNCE2001 Wireless Transceiver

2011-10-08T07:27:00.000-07:00

Recently I bought a wireless transceiver from Netgear. It’s quite tidy at 8cm x 6cm x 1.5cm. I needed it to connnect a bunch of ethernet only devices on a switch to a WPA2 wireless network.

It comes with two connectors: a standard RJ-45 Ethernet port and a power jack which can connect to USB or a standard power socket.

In order to configure this device (with SSID and PSK), the power is connected to the USB on your PC and the ethernet is connected to your ethernet port. After the power and LAN lights go solid, start up your browser and you will automatically be taken to a page to configure the WLAN. Once configured, it operates as a bridge between ethernet and wireless segments.

The level of sophisticaton required to perform the configuration means that this little box must at least have an embedded DHCP server, web server and DNS server on it. Even then it still has to subvert my request for a home page out on the internet and give me back a configuration wizard page.

So I started up Wireshark and watched what happened.

This first section shows the startup through settling down in readiness for configuration:

No.     Time                       Source                Destination           Protocol Info
      1 2011-10-06 16:53:59.741855 CompalIn_fb:67:4c     Nearest               EAPOL    Start
        *** The PC is a potential supplicant requesting authentication in an 802.1X port access control environment (ignored hereafter)

      2 2011-10-06 16:54:02.519609 0.0.0.0               255.255.255.255       DHCP     DHCP Request  - Transaction ID 0x3df3ee15
        ***DHCP requests repeated (and omitted from here) a few times during INIT-REBOOT state

     10 2011-10-06 16:54:25.718628 0.0.0.0               255.255.255.255       DHCP     DHCP Request  - Transaction ID 0x2ca33bbe
        ***Here is the first DHCP request from the PC (requesting verification of its last used address) that receives a response

        User Datagram Protocol, Src Port: 68 (68), Dst Port: 67 (67)
        Bootstrap Protocol
            Message type: Boot Request (1)
            Transaction ID: 0x2ca33bbe
            Client IP address: 0.0.0.0 (0.0.0.0)
            Your (client) IP address: 0.0.0.0 (0.0.0.0)
            Next server IP address: 0.0.0.0 (0.0.0.0)
            Relay agent IP address: 0.0.0.0 (0.0.0.0)
            Client MAC address: CompalIn_fb:67:4c (00:1b:38:fb:67:4c)
            Option: (t=53,l=1) DHCP Message Type = DHCP Request
            Option: (t=61,l=7) Client identifier
            Option: (t=50,l=4) Requested IP Address = 192.168.1.73
            Option: (t=12,l=6) Host Name = "PCNAME"
            Option: (t=81,l=31) Client Fully Qualified Domain Name
            Option: (t=60,l=8) Vendor class identifier = "MSFT 5.0"
            Option: (t=55,l=11) Parameter Request List
            Option: (t=43,l=3) Vendor-Specific Information

     12 2011-10-06 16:54:25.757167 192.168.1.251         255.255.255.255       DHCP     DHCP NAK      - Transaction ID 0x2ca33bbe
        *** The Netgear DHCP server tells the PC not to use the address in its last DHCP request
        *** Interesting choice of address by Netgear - keeping away from first or last in case of overlap with the outside WLAN DHCP server?

        Bootstrap Protocol
            Message type: Boot Reply (2)
            Transaction ID: 0x2ca33bbe
            Client IP address: 0.0.0.0 (0.0.0.0)
            Your (client) IP address: 0.0.0.0 (0.0.0.0)
            Next server IP address: 0.0.0.0 (0.0.0.0)
            Relay agent IP address: 0.0.0.0 (0.0.0.0)
            Client MAC address: CompalIn_fb:67:4c (00:1b:38:fb:67:4c)
            Option: (t=53,l=1) DHCP Message Type = DHCP NAK
            Option: (t=54,l=4) DHCP Server Identifier = 192.168.1.251

     13 2011-10-06 16:54:26.838914 0.0.0.0               255.255.255.255       DHCP     DHCP Discover - Transaction ID 0x5391d124
        *** The PC starts from scratch and asks for a new allocation

        Bootstrap Protocol
            Message type: Boot Request (1)
            Transaction ID: 0x5391d124
            Client IP address: 0.0.0.0 (0.0.0.0)
            Your (client) IP address: 0.0.0.0 (0.0.0.0)
            Next server IP address: 0.0.0.0 (0.0.0.0)
            Relay agent IP address: 0.0.0.0 (0.0.0.0)
            Client MAC address: CompalIn_fb:67:4c (00:1b:38:fb:67:4c)
            Option: (t=53,l=1) DHCP Message Type = DHCP Discover
            Option: (t=116,l=1) DHCP Auto-Configuration = AutoConfigure
            Option: (t=61,l=7) Client identifier
            Option: (t=12,l=6) Host Name = "PCNAME"
            Option: (t=60,l=8) Vendor class identifier = "MSFT 5.0"
            Option: (t=55,l=11) Parameter Request List
            Option: (t=43,l=2) Vendor-Specific Information

     14 2011-10-06 16:54:26.862836 Netgear_77:cf:fe      Broadcast             ARP      Who has 192.168.1.100?  Tell 192.168.1.251
        *** Netgear checks that no one is using the address it wants to use for itself

     15 2011-10-06 16:54:28.929641 192.168.1.251         192.168.1.100         DHCP     DHCP Offer    - Transaction ID 0x5391d124
        *** Netgear gives the PC a new address, and tells it to use the Netgear address for gateway and DNS queries

        User Datagram Protocol, Src Port: 67 (67), Dst Port: 68 (68)
        Bootstrap Protocol
            Message type: Boot Reply (2)
            Transaction ID: 0x5391d124
            Client IP address: 0.0.0.0 (0.0.0.0)
            Your (client) IP address: 192.168.1.100 (192.168.1.100)
            Next server IP address: 0.0.0.0 (0.0.0.0)
            Relay agent IP address: 0.0.0.0 (0.0.0.0)
            Client MAC address: CompalIn_fb:67:4c (00:1b:38:fb:67:4c)
            Option: (t=53,l=1) DHCP Message Type = DHCP Offer
            Option: (t=54,l=4) DHCP Server Identifier = 192.168.1.251
            Option: (t=51,l=4) IP Address Lease Time = 1 minute
            Option: (t=1,l=4) Subnet Mask = 255.255.255.0
            Option: (t=3,l=4) Router = 192.168.1.251
            Option: (t=6,l=4) Domain Name Server = 192.168.1.251

     16 2011-10-06 16:54:28.930805 0.0.0.0               255.255.255.255       DHCP     DHCP Request  - Transaction ID 0x5391d124
        *** PC selects its best (only) offer and asks for confirmation

        Bootstrap Protocol
            Message type: Boot Request (1)
            Transaction ID: 0x5391d124
            Client IP address: 0.0.0.0 (0.0.0.0)
            Your (client) IP address: 0.0.0.0 (0.0.0.0)
            Next server IP address: 0.0.0.0 (0.0.0.0)
            Relay agent IP address: 0.0.0.0 (0.0.0.0)
            Client MAC address: CompalIn_fb:67:4c (00:1b:38:fb:67:4c)
            Option: (t=53,l=1) DHCP Message Type = DHCP Request
            Option: (t=61,l=7) Client identifier
            Option: (t=50,l=4) Requested IP Address = 192.168.1.100
            Option: (t=54,l=4) DHCP Server Identifier = 192.168.1.251
            Option: (t=12,l=6) Host Name = "PCNAME"
            Option: (t=81,l=31) Client Fully Qualified Domain Name
            Option: (t=60,l=8) Vendor class identifier = "MSFT 5.0"
            Option: (t=55,l=11) Parameter Request List
            Option: (t=43,l=3) Vendor-Specific Information

     17 2011-10-06 16:54:28.973606 192.168.1.251         192.168.1.100         DHCP     DHCP ACK      - Transaction ID 0x5391d124
        *** And Netgear confirms. Note that the lease time is only 1 minute

        Bootstrap Protocol
            Message type: Boot Reply (2)
            Transaction ID: 0x5391d124
            Client IP address: 0.0.0.0 (0.0.0.0)
            Your (client) IP address: 192.168.1.100 (192.168.1.100)
            Next server IP address: 0.0.0.0 (0.0.0.0)
            Relay agent IP address: 0.0.0.0 (0.0.0.0)
            Client MAC address: CompalIn_fb:67:4c (00:1b:38:fb:67:4c)
            Option: (t=53,l=1) DHCP Message Type = DHCP ACK
            Option: (t=54,l=4) DHCP Server Identifier = 192.168.1.251
            Option: (t=51,l=4) IP Address Lease Time = 1 minute
            Option: (t=1,l=4) Subnet Mask = 255.255.255.0
            Option: (t=3,l=4) Router = 192.168.1.251
            Option: (t=6,l=4) Domain Name Server = 192.168.1.251


     18 2011-10-06 16:54:28.999218 CompalIn_fb:67:4c     Broadcast             ARP      Gratuitous ARP for 192.168.1.100 (Request)
        ** PC sends out a few of these to make sure other devices update their arp tables

     22 2011-10-06 16:54:31.933507 CompalIn_fb:67:4c     Broadcast             ARP      Who has 192.168.1.251?  Tell 192.168.1.100
     23 2011-10-06 16:54:31.933929 Netgear_77:cf:fe      CompalIn_fb:67:4c     ARP      192.168.1.251 is at c4:3d:c7:77:cf:fe
        *** Then updates its own

     *** This is a work PC from an AD and Novell environment. Interesting that Netgear choses to answer these queries
     *** The strategy seems to be to resolve all names to itself
     28 2011-10-06 16:54:32.002898 192.168.1.100         192.168.1.251         DNS      Standard query SRV _ldap._tcp.WORKCampus._sites.dc._msdcs.subdom.workdomain.co.uk
     29 2011-10-06 16:54:32.003300 192.168.1.251         192.168.1.100         DNS      Standard query response SRV
     30 2011-10-06 16:54:32.005632 192.168.1.100         192.168.1.251         DNS      Standard query SOA PCNAME.subdom.workdomain.co.uk
     31 2011-10-06 16:54:32.006002 192.168.1.251         192.168.1.100         DNS      Standard query response SOA[Malformed Packet]
     33 2011-10-06 16:54:32.046312 192.168.1.100         192.168.1.251         DNS      Standard query A time1.workdomain.co.uk
     34 2011-10-06 16:54:32.046677 192.168.1.251         192.168.1.100         DNS      Standard query response A 192.168.1.251

     35 2011-10-06 16:54:32.054459 192.168.1.100         192.168.1.251         NTP      NTP symmetric active
     36 2011-10-06 16:54:32.055120 192.168.1.251         192.168.1.100         ICMP     Destination unreachable (Port unreachable)
        *** Of course, having inadvertently told the PC that it is a time server, it can't actually handle the ntp request

     *** We get a similar story with SLP:
     37 2011-10-06 16:54:32.237637 192.168.1.100         192.168.1.251         DNS      Standard query A slp1.workdomain.co.uk
     38 2011-10-06 16:54:32.237933 192.168.1.251         192.168.1.100         DNS      Standard query response A 192.168.1.251
     39 2011-10-06 16:54:32.238284 192.168.1.100         192.168.1.251         SRVLOC   Service Request, V2 XID - 3755
     40 2011-10-06 16:54:32.238387 192.168.1.100         192.168.1.251         DNS      Standard query A slp2.workdomain.co.uk
     41 2011-10-06 16:54:32.238623 192.168.1.251         192.168.1.100         ICMP     Destination unreachable (Port unreachable)
     42 2011-10-06 16:54:32.238640 192.168.1.251         192.168.1.100         DNS      Standard query response A 192.168.1.251
     43 2011-10-06 16:54:32.238909 192.168.1.100         192.168.1.251         SRVLOC   Service Request, V1 Transaction ID - 3756
     44 2011-10-06 16:54:32.239248 192.168.1.251         192.168.1.100         ICMP     Destination unreachable (Port unreachable)

     *** Must be time to refresh the arp cache:
     75 2011-10-06 16:54:37.053065 Netgear_77:cf:fe      CompalIn_fb:67:4c     ARP      Who has 192.168.1.100?  Tell 192.168.1.251
     76 2011-10-06 16:54:37.053085 CompalIn_fb:67:4c     Netgear_77:cf:fe      ARP      192.168.1.100 is at 00:1b:38:fb:67:4c
     89 2011-10-06 16:54:40.749850 192.168.1.100         192.168.1.251         ICMP     Echo (ping) request  (id=0x0300, seq(be/le)=16896/66, ttl=1)
     90 2011-10-06 16:54:40.750236 192.168.1.251         192.168.1.100         ICMP     Echo (ping) reply    (id=0x0300, seq(be/le)=16896/66, ttl=64)

     *** Now the Netgear appears to be taking on Netbios master browser functions for the segment - the first exchange of a few is shown only
     92 2011-10-06 16:54:41.009843 192.168.1.100         192.168.1.255         NBNS     Registration NB PCNAME<00>
        NetBIOS Name Service
            Transaction ID: 0x8906
            Flags: 0x2910 (Registration)
                0... .... .... .... = Response: Message is a query
                .010 1... .... .... = Opcode: Registration (5)
                .... ..0. .... .... = Truncated: Message is not truncated
                .... ...1 .... .... = Recursion desired: Do query recursively
                .... .... ...1 .... = Broadcast: Broadcast packet
            Questions: 1
            Answer RRs: 0
            Authority RRs: 0
            Additional RRs: 1
            Queries
                PCNAME<00>: type NB, class IN
                    Name: PCNAME<00> (Workstation/Redirector)
                    Type: NB
                    Class: IN
            Additional records
                PCNAME<00>: type NB, class IN
                    Name: PCNAME<00> (Workstation/Redirector)
                    Type: NB
                    Class: IN
                    Time to live: 3 days, 11 hours, 20 minutes
                    Data length: 6
                    Flags: 0x6000 (H-node, unique)
                        0... .... .... .... = Unique name
                        .11. .... .... .... = H-node
                    Addr: 192.168.1.100

     93 2011-10-06 16:54:41.011045 192.168.1.251         192.168.1.100         NBNS     Name query response NB 192.168.1.251
        NetBIOS Name Service
            Transaction ID: 0x8906
            Flags: 0x8400 (Name query response, No error)
                1... .... .... .... = Response: Message is a response
                .000 0... .... .... = Opcode: Name query (0)
                .... .1.. .... .... = Authoritative: Server is an authority for domain
                .... ..0. .... .... = Truncated: Message is not truncated
                .... ...0 .... .... = Recursion desired: Don't do query recursively
                .... .... 0... .... = Recursion available: Server can't do recursive queries
                .... .... ...0 .... = Broadcast: Not a broadcast packet
                .... .... .... 0000 = Reply code: No error (0)
            Questions: 0
            Answer RRs: 1
            Authority RRs: 0
            Additional RRs: 1
            Answers
                PCNAME<00>: type NB, class IN
                    Name: PCNAME<00> (Workstation/Redirector)
                    Type: NB
                    Class: IN
                    Time to live: 3117 days, 12 hours, 16 minutes
                    Data length: 1536
                    Flags: 0x4000 (M-node, unique)
                        0... .... .... .... = Unique name
                        .10. .... .... .... = M-node
                    Addr: 192.168.1.251

    *** Netgear's strategy of resolving all names to itself means it has to fend off Novell traffic as well:
    102 2011-10-06 16:54:43.389929 192.168.1.100         192.168.1.251         DNS      Standard query A NOVELL-NDS-TREE.subdom.workdomain.co.uk
    103 2011-10-06 16:54:43.390306 192.168.1.251         192.168.1.100         DNS      Standard query response A 192.168.1.251
    104 2011-10-06 16:54:43.390479 192.168.1.100         192.168.1.251         TCP      3438 > 524 [SYN] Seq=0 Win=65535 Len=0 MSS=1260 SACK_PERM=1
    105 2011-10-06 16:54:43.390809 192.168.1.251         192.168.1.100         TCP      524 > 3438 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0

The middle section starts when we launch our browser through to the start of the wizard

    *** Now we start up firefox - interesting to note that firefox appears to look up the home page address before it checks for proxy settings
    164 2011-10-06 16:55:00.309853 192.168.1.100         192.168.1.251         DNS      Standard query A en-gb.start3.mozilla.com
    165 2011-10-06 16:55:00.310404 192.168.1.251         192.168.1.100         DNS      Standard query response A 192.168.1.251
        *** Predictably, Netgear resolves the address to itself

    167 2011-10-06 16:55:00.669057 192.168.1.100         192.168.1.251         DNS      Standard query A wpad.subdom.workdomain.co.uk
    168 2011-10-06 16:55:00.669391 192.168.1.251         192.168.1.100         DNS      Standard query response A 192.168.1.251
        *** Standard behaviour when automatically detect proxy is set - go looking for WPAD

    171 2011-10-06 16:55:00.680058 192.168.1.100         192.168.1.251         TCP      3444 > 80 [SYN] Seq=0 Win=65535 Len=0 MSS=1260 SACK_PERM=1
    172 2011-10-06 16:55:00.680401 192.168.1.251         192.168.1.100         TCP      80 > 3444 [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1460 SACK_PERM=1
    173 2011-10-06 16:55:00.680419 192.168.1.100         192.168.1.251         TCP      3444 > 80 [ACK] Seq=1 Ack=1 Win=65535 Len=0
        *** TCP handshake with the Netgear web server

    174 2011-10-06 16:55:00.680489 192.168.1.100         192.168.1.251         HTTP     GET /wpad.dat HTTP/1.1 
        *** Firefox asks for the javascript proxy script

    176 2011-10-06 16:55:00.689129 192.168.1.251         192.168.1.100         HTTP     HTTP/1.1 200 OK  (text/html)
        *** And Netgear tries to redirect it, however this doesn't include a "function FindProxyForURL(url, host)" so it will be ignored

        Hypertext Transfer Protocol
        Line-based text data: text/html
            <html>\r\n
            \t<head>\r\n
            \r\n
            \t\t<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">\n
            \t\t<meta http-equiv="pragma" content="no-cache"> \n
            \t\t<meta http-equiv="cache-control" content="no-cache, must-revalidate"> \n
            \t\t<meta http-equiv="expires" content="0">\t\t<script language="javascript" type="text/javascript">\r\n
            \t\t\tlocation.replace("http://www.mywifiext.net/");\r\n
            \t\t</script>\r\n
            \t</head>\r\n
            \t<body> \r\n
            \t</body>\r\n
            </html>\r\n
            \r\n

    *** Sure enough, Firefox ploughs on to open its original site directly
    177 2011-10-06 16:55:00.761313 192.168.1.100         192.168.1.251         TCP      3445 > 80 [SYN] Seq=0 Win=65535 Len=0 MSS=1260 SACK_PERM=1
    178 2011-10-06 16:55:00.761650 192.168.1.251         192.168.1.100         TCP      80 > 3445 [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1460 SACK_PERM=1
    179 2011-10-06 16:55:00.761663 192.168.1.100         192.168.1.251         TCP      3445 > 80 [ACK] Seq=1 Ack=1 Win=65535 Len=0
    180 2011-10-06 16:55:00.761738 192.168.1.100         192.168.1.251         HTTP     GET /firefox?client=firefox-a&rls=org.mozilla:en-GB:official HTTP/1.1 
        Hypertext Transfer Protocol
            GET /firefox?client=firefox-a&rls=org.mozilla:en-GB:official HTTP/1.1\r\n
            Host: en-gb.start3.mozilla.com\r\n
            User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.2.23) Gecko/20110920 Firefox/3.6.23 ( .NET CLR 3.5.30729; .NET4.0E)\r\n
            Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n
            Accept-Language: en-gb,en;q=0.5\r\n
            Accept-Encoding: gzip,deflate\r\n
            Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n
            Keep-Alive: 115\r\n
            Connection: keep-alive\r\n
            Cookie: WT_FPC=id=194.176.105.39-1877985552.30159488:lv=1313671495942:ss=1313671495942\r\n
            \r\n

    182 2011-10-06 16:55:00.764144 192.168.1.251         192.168.1.100         HTTP     HTTP/1.1 200 OK  (text/html)
        *** Netgear has another attempt at redirecting. Note the headers have revealed that Netgear is using lighttpd web server

        Hypertext Transfer Protocol
            HTTP/1.1 200 OK\r\n
            Expires: Sat, 01 Jan 2000 00:00:12 GMT\r\n
            Cache-Control: max-age=1\r\n
            Content-Type: text/html\r\n
            Accept-Ranges: bytes\r\n
            Content-Length: 419\r\n
            Date: Sat, 01 Jan 2000 00:00:55 GMT\r\n
            Server: lighttpd/1.4.18\r\n
            \r\n
        Line-based text data: text/html
            <html>\r\n
            \t<head>\r\n
            \r\n
            \t\t<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">\n
            \t\t<meta http-equiv="pragma" content="no-cache"> \n
            \t\t<meta http-equiv="cache-control" content="no-cache, must-revalidate"> \n
            \t\t<meta http-equiv="expires" content="0">\t\t<script language="javascript" type="text/javascript">\r\n
            \t\t\tlocation.replace("http://www.mywifiext.net/");\r\n
            \t\t</script>\r\n
            \t</head>\r\n
            \t<body> \r\n
            \t</body>\r\n
            </html>\r\n
            \r\n

    183 2011-10-06 16:55:00.839579 192.168.1.100         192.168.1.251         DNS      Standard query A www.mywifiext.net
    184 2011-10-06 16:55:00.839907 192.168.1.251         192.168.1.100         DNS      Standard query response A 192.168.1.251
        *** The redirection appears to work as Firefox now locates the new webpage

    185 2011-10-06 16:55:00.840931 192.168.1.100         192.168.1.251         TCP      3446 > 80 [SYN] Seq=0 Win=65535 Len=0 MSS=1260 SACK_PERM=1
    186 2011-10-06 16:55:00.841281 192.168.1.251         192.168.1.100         TCP      80 > 3446 [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1460 SACK_PERM=1
    187 2011-10-06 16:55:00.841300 192.168.1.100         192.168.1.251         TCP      3446 > 80 [ACK] Seq=1 Ack=1 Win=65535 Len=0
    188 2011-10-06 16:55:00.841371 192.168.1.100         192.168.1.251         HTTP     GET / HTTP/1.1 
        *** And we're off - the first page of our wizard
        *** Yards of omitted stuff after this...

Through the wizard, Netgear lists all of the visible SSIDs, allowing one to be selected and a key to be entered. The last section shows what happens as the Netgear associates to the WLAN and then turns into a bridge (or does it?)

   *** ...and as the last bit of data is finally posted from the wizard:
   1375 2011-10-06 16:56:20.549963 192.168.1.100         192.168.1.251         HTTP     POST /my_cgi.cgi?0.774414189696751 HTTP/1.1  (application/x-www-form-urlencoded)
   1379 2011-10-06 16:56:20.745822 192.168.1.251         192.168.1.100         HTTP/XML HTTP/1.1 200 OK 

   1381 2011-10-06 16:56:21.033167 0.0.0.0               255.255.255.255       DHCP     DHCP Discover - Transaction ID 0x7f52875c
        *** Netgear behaviour suggests that it is now attempting to become a DHCP client on someone else's segment
        *** However, Netgear now has two segments: WLAN and Ethernet, so it appears to be bridging between both mediums

        Bootstrap Protocol
            Message type: Boot Request (1)
            Transaction ID: 0x7f52875c
            Client IP address: 0.0.0.0 (0.0.0.0)
            Your (client) IP address: 0.0.0.0 (0.0.0.0)
            Next server IP address: 0.0.0.0 (0.0.0.0)
            Relay agent IP address: 0.0.0.0 (0.0.0.0)
            Client MAC address: Netgear_77:cf:fe (c4:3d:c7:77:cf:fe)
            Option: (t=53,l=1) DHCP Message Type = DHCP Discover
            Option: (t=61,l=7) Client identifier
            Option: (t=12,l=8) Host Name = "wnce2001"
            Option: (t=60,l=15) Vendor class identifier = "udhcp 0.9.9-pre"
            Option: (t=55,l=10) Parameter Request List

   *** This is confirmed, because we can now at last see traffic from the WLAN coming through:
   1382 2011-10-06 16:56:21.358369 ThomsonT_1b:1c:14     Broadcast             ARP      Who has 192.168.1.74?  Tell 192.168.1.254
   1383 2011-10-06 16:56:21.972481 ThomsonT_1b:1c:14     Broadcast             ARP      Who has 192.168.1.74?  Tell 192.168.1.254
   1385 2011-10-06 16:56:23.061375 0.0.0.0               255.255.255.255       DHCP     DHCP Discover - Transaction ID 0x7f52875c
   1386 2011-10-06 16:56:23.202151 ThomsonT_1b:1c:14     Broadcast             ARP      Who has 192.168.1.74?  Tell 192.168.1.254
        *** This looks like the WLAN router checking a free address:

        Address Resolution Protocol (request)
            Hardware type: Ethernet (0x0001)
            Protocol type: IP (0x0800)
            Hardware size: 6
            Protocol size: 4
            Opcode: request (0x0001)
            [Is gratuitous: False]
            Sender MAC address: ThomsonT_1b:1c:14 (00:26:44:1b:1c:14)
            Sender IP address: 192.168.1.254 (192.168.1.254)
            Target MAC address: 00:00:00_00:00:00 (00:00:00:00:00:00)
            Target IP address: 192.168.1.74 (192.168.1.74)

   1387 2011-10-06 16:56:23.205098 192.168.1.254         255.255.255.255       DHCP     DHCP Offer    - Transaction ID 0x7f52875c
        *** ... before giving it to Netgear. Note the lease is a decent length now.

        Bootstrap Protocol
            Message type: Boot Reply (2)
            Transaction ID: 0x7f52875c
            Client IP address: 0.0.0.0 (0.0.0.0)
            Your (client) IP address: 192.168.1.74 (192.168.1.74)
            Next server IP address: 0.0.0.0 (0.0.0.0)
            Relay agent IP address: 192.168.1.254 (192.168.1.254)
            Client MAC address: Netgear_77:cf:fe (c4:3d:c7:77:cf:fe)
            Option: (t=53,l=1) DHCP Message Type = DHCP Offer
            Option: (t=54,l=4) DHCP Server Identifier = 192.168.1.254
            Option: (t=51,l=4) IP Address Lease Time = 1 day
            Option: (t=1,l=4) Subnet Mask = 255.255.255.0
            Option: (t=6,l=4) Domain Name Server = 192.168.1.254
            Option: (t=15,l=4) Domain Name = "home"
            Option: (t=3,l=4) Router = 192.168.1.254


   1388 2011-10-06 16:56:23.261309 0.0.0.0               255.255.255.255       DHCP     DHCP Request  - Transaction ID 0x7f52875c
        *** Netgear asks for confirmation that it can use the address
        Bootstrap Protocol
            Transaction ID: 0x7f52875c
            Client IP address: 0.0.0.0 (0.0.0.0)
            Your (client) IP address: 0.0.0.0 (0.0.0.0)
            Next server IP address: 0.0.0.0 (0.0.0.0)
            Relay agent IP address: 0.0.0.0 (0.0.0.0)
            Client MAC address: Netgear_77:cf:fe (c4:3d:c7:77:cf:fe)
            Option: (t=53,l=1) DHCP Message Type = DHCP Request
            Option: (t=61,l=7) Client identifier
            Option: (t=12,l=8) Host Name = "wnce2001"
            Option: (t=60,l=15) Vendor class identifier = "udhcp 0.9.9-pre"
            Option: (t=50,l=4) Requested IP Address = 192.168.1.74
            Option: (t=54,l=4) DHCP Server Identifier = 192.168.1.254
            Option: (t=55,l=10) Parameter Request List

   *** slightly different behaviour from the main WLAN router - it is also arping at offer stage
   1389 2011-10-06 16:56:24.123034 ThomsonT_1b:1c:14     Broadcast             ARP      Who has 192.168.1.74?  Tell 192.168.1.254
   1390 2011-10-06 16:56:25.044698 ThomsonT_1b:1c:14     Broadcast             ARP      Who has 192.168.1.74?  Tell 192.168.1.254

   *** ...and we see the Netgear DHCP transaction go onto complete on that IP address
   *** I have a mild concern here about why a wireless bridge with a sophisticated embedded Linux system would require layer 3 membership on my WLAN...
   1391 2011-10-06 16:56:25.309071 0.0.0.0               255.255.255.255       DHCP     DHCP Request  - Transaction ID 0x7f52875c
   1392 2011-10-06 16:56:26.275711 192.168.1.254         255.255.255.255       DHCP     DHCP Offer    - Transaction ID 0x7f52875c
   1393 2011-10-06 16:56:26.279179 192.168.1.254         255.255.255.255       DHCP     DHCP ACK      - Transaction ID 0x7f52875c
   1394 2011-10-06 16:56:26.282191 192.168.1.254         255.255.255.255       DHCP     DHCP ACK      - Transaction ID 0x7f52875c
   1395 2011-10-06 16:56:27.157069 Netgear_77:cf:fe      Broadcast             ARP      Who has 192.168.1.254?  Tell 192.168.1.74

   *** But what about our PC? it seems somehow to have twigged that something has changed - DHCP requests (although still unicast), EAPOL etc
   *** I didn't notice Netgear dropping the ethernet link, but it that would have been one way to do it
   1396 2011-10-06 16:56:28.252537 192.168.1.100         192.168.1.251         DHCP     DHCP Request  - Transaction ID 0x7e41999d
   1397 2011-10-06 16:56:36.393373 192.168.1.100         192.168.1.251         ICMP     Echo (ping) request  (id=0x0300, seq(be/le)=17152/67, ttl=1)
   1398 2011-10-06 16:56:36.420095 CompalIn_fb:67:4c     Nearest               EAPOL    Start
   1399 2011-10-06 16:56:37.414054 192.168.1.100         192.168.1.251         ICMP     Echo (ping) request  (id=0x0300, seq(be/le)=17408/68, ttl=1)
   1401 2011-10-06 16:56:38.915020 192.168.1.100         192.168.1.251         ICMP     Echo (ping) request  (id=0x0300, seq(be/le)=17664/69, ttl=1)

   *** After about 12 secs, the PC starts broadcasting its request to extend the lease on its existing IP address
   1402 2011-10-06 16:56:40.418751 192.168.1.100         255.255.255.255       DHCP     DHCP Request  - Transaction ID 0xc0964cdc
        Bootstrap Protocol
            Message type: Boot Request (1)
            Transaction ID: 0xc0964cdc
            Client IP address: 192.168.1.100 (192.168.1.100)
            Your (client) IP address: 0.0.0.0 (0.0.0.0)
            Next server IP address: 0.0.0.0 (0.0.0.0)
            Relay agent IP address: 0.0.0.0 (0.0.0.0)
            Client MAC address: CompalIn_fb:67:4c (00:1b:38:fb:67:4c)
            Option: (t=53,l=1) DHCP Message Type = DHCP Request
            Option: (t=61,l=7) Client identifier
            Option: (t=12,l=6) Host Name = "PCNAME"
            Option: (t=81,l=31) Client Fully Qualified Domain Name
            Option: (t=60,l=8) Vendor class identifier = "MSFT 5.0"
            Option: (t=55,l=11) Parameter Request List
            Option: (t=43,l=3) Vendor-Specific Information

   1403 2011-10-06 16:56:40.717438 192.168.1.254         255.255.255.255       DHCP     DHCP NAK      - Transaction ID 0xc0964cdc
        *** But, predictably, WLAN router says no!

        Bootstrap Protocol
            Message type: Boot Reply (2)
            Transaction ID: 0xc0964cdc
            Client IP address: 0.0.0.0 (0.0.0.0)
            Your (client) IP address: 0.0.0.0 (0.0.0.0)
            Next server IP address: 0.0.0.0 (0.0.0.0)
            Relay agent IP address: 192.168.1.254 (192.168.1.254)
            Client MAC address: CompalIn_fb:67:4c (00:1b:38:fb:67:4c)
            Option: (t=53,l=1) DHCP Message Type = DHCP NAK
            Option: (t=54,l=4) DHCP Server Identifier = 192.168.1.254
            Option: (t=56,l=26) Message = "REQUEST for invalid lease"

   1409 2011-10-06 16:56:45.836342 0.0.0.0               255.255.255.255       DHCP     DHCP Discover - Transaction ID 0xc9b00679
        *** So we start from scratch again

   1410 2011-10-06 16:56:45.939624 192.168.1.254         255.255.255.255       DHCP     DHCP Offer    - Transaction ID 0xc9b00679
        *** New address for use on WLAN

        Bootstrap Protocol
            Message type: Boot Reply (2)
            Transaction ID: 0xc9b00679
            Client IP address: 0.0.0.0 (0.0.0.0)
            Your (client) IP address: 192.168.1.73 (192.168.1.73)
            Next server IP address: 0.0.0.0 (0.0.0.0)
            Relay agent IP address: 192.168.1.254 (192.168.1.254)
            Client MAC address: CompalIn_fb:67:4c (00:1b:38:fb:67:4c)
            Option: (t=53,l=1) DHCP Message Type = DHCP Offer
            Option: (t=54,l=4) DHCP Server Identifier = 192.168.1.254
            Option: (t=51,l=4) IP Address Lease Time = 1 day
            Option: (t=1,l=4) Subnet Mask = 255.255.255.0
            Option: (t=15,l=4) Domain Name = "home"
            Option: (t=6,l=4) Domain Name Server = 192.168.1.254
            Option: (t=3,l=4) Router = 192.168.1.254

   *** And then the transaction completes as before
   1411 2011-10-06 16:56:45.940058 0.0.0.0               255.255.255.255       DHCP     DHCP Request  - Transaction ID 0xc9b00679
   1412 2011-10-06 16:56:46.246842 192.168.1.254         255.255.255.255       DHCP     DHCP ACK      - Transaction ID 0xc9b00679
   1414 2011-10-06 16:56:46.273772 CompalIn_fb:67:4c     Broadcast             ARP      Gratuitous ARP for 192.168.1.73 (Request)
   1415 2011-10-06 16:56:46.421869 CompalIn_fb:67:4c     Nearest               EAPOL    Start
   1416 2011-10-06 16:56:46.914798 CompalIn_fb:67:4c     Broadcast             ARP      Gratuitous ARP for 192.168.1.73 (Request)
   1420 2011-10-06 16:56:47.471850 CompalIn_fb:67:4c     ThomsonT_1b:1c:14     ARP      192.168.1.73 is at 00:1b:38:fb:67:4c


   *** Lastly, of mild interest only, is the difference in behaviour between the Netgear DNS and the broadband router DNS
   *** This is now back to normal behaviour
   1447 2011-10-06 16:56:49.006741 192.168.1.73          192.168.1.254         DNS      Standard query SRV _ldap._tcp.WORKCampus._sites.dc._msdcs.subdom.workdomain.co.uk
   1449 2011-10-06 16:56:49.046225 192.168.1.254         192.168.1.73          DNS      Standard query response, No such name
   1450 2011-10-06 16:56:49.062867 192.168.1.73          192.168.1.254         DNS      Standard query SRV _ldap._tcp.dc._msdcs.subdom.workdomain.co.uk
   1451 2011-10-06 16:56:49.071474 192.168.1.73          192.168.1.254         DNS      Standard query SOA PCNAME.subdom.workdomain.co.uk
   1452 2011-10-06 16:56:49.094456 192.168.1.254         192.168.1.73          DNS      Standard query response, No such name
   1453 2011-10-06 16:56:49.101329 192.168.1.254         192.168.1.73          DNS      Standard query response, No such name

So quite a slick operation, but one that requires me to be happy having a third-party linux server sitting on my network disguised as a dumb transceiver. As far as I could see (from my broadband router arp table), the server went to sleep once it had done its job of creating a WLAN bridge, but who knows if it could be woken up...

Nagios Plugin 'check_procs' incorrectly finds 0 processes

2011-09-19T05:33:00.000-07:00

When checking for running processes on remote Linux systems via NRPE, the Nagios plugin check_procs –C <process commandname> occasionally responds with unexpected results.

Example on a Zenworks 7 server

If we look for the tftpd daemon using ps:

ZEN03:/usr/local/nagios/libexec # ps -ef|grep tftpd

root 4103 1 0 Sep14 ? 00:01:31 /opt/novell/bin/novell-tftpd

root 20047 17950 0 13:06 pts/0 00:00:00 grep tftpd

Then we look for it with check_procs:

ZEN03:/usr/local/nagios/libexec # ./check_procs -C novell-tftpd

PROCS OK: 1 process with command name 'novell-tftpd'

The check_procs plugin correctly reports that one process has been found with this name

However, if we look for the proxy dhcp daemon using ps:

ZEN03:/usr/local/nagios/libexec # ps -ef |grep proxy

root 21171 1 0 Sep18 ? 00:00:00 /opt/novell/bin/novell-proxydhcpd

root 20137 17950 0 13:07 pts/0 00:00:00 grep proxy

And then with check_procs:

ZEN03:/usr/local/nagios/libexec # ./check_procs -C novell-proxydhcpd

PROCS OK: 0 processes with command name 'novell-proxydhcpd'

In this case, the check_procs plugin has reported that 0 processes have been found, even though we can clearly see that this is not the case.

The trick in these situations is to ask check_procs for more information using the –vv switch:

ZEN03:/usr/local/nagios/libexec # ./check_procs -vv -C novell-proxydhcpd

CMD: /bin/ps axwo 'stat uid pid ppid vsz rss pcpu comm args'

PROCS OK: 0 processes with command name 'novell-proxydhcpd'

Here check_procs has told us what it is passing to ps to find out the information it is reporting back to us.

So let us use that parameter list for our own check:

ZEN03:/usr/local/nagios/libexec # /bin/ps axwo 'stat uid pid ppid vsz rss pcpu comm args'|grep proxy

S 0 21171 1 1412 396 0.0 novell-proxydhc /opt/novell/bin/novell-proxydhcpd

S+ 0 20263 17950 1712 672 0.0 grep grep proxy

And there it is – on this system (SLES9), ps is only reporting back the first 15 characters, so no match is being found.

So here, what we have to do is to ask check_procs to only look for the first 15 characters

ZEN03:/usr/local/nagios/libexec # ./check_procs -C novell-proxydhc

PROCS OK: 1 process with command name 'novell-proxydhc'

We can now correctly check for our proxy dhcp daemon in Nagios.

Nagios check_vmfs Quirk

2011-09-12T16:53:00.000-07:00

I came across Duncan Epping's check_vmfs bash script the other day when I was looking at what NRPE could do in our ESX 3.5 environment. It's fairly basic and did what it said it could do: report on vmfs space utilisation.

For interest, here is a copy of the script I wanted to use:

#!/bin/bash
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
MYVOL=$1
WARNTHRESH=$2
CRITTHRESH=$3
RET=$?
if [[ $RET -ne 0 ]]
    then
    echo "query problem - No data received from host"
    exit $STATE_UNKNOWN
fi
vdf -h -P | grep -E '^/vmfs/volumes/' | awk '{ print $2 " " $3 " " $4 " " $5 " " $6 }' | while read output ; do
DISKSIZE=$(echo $output | awk '{ print $1 }' )
DISKUSED=$(echo $output | awk '{ print $2 }' )
DISKAVAILABLE=$(echo $output | awk '{ print $3 }' )
PERCENTINUSE=$(echo $output | awk '{ print $4 }' )
VOLNAME=$(echo $output | awk '{ print $5 }' )
CUTPERC=$(echo $PERCENTINUSE | cut -d'%' -f1 )
if [ "/vmfs/volumes/$MYVOL" = $VOLNAME ] ; then
    if [ $CUTPERC -lt $WARNTHRESH ] ; then
        echo "OK - $PERCENTINUSE used | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
        exit $STATE_OK
    fi
    if [ $CUTPERC -ge $CRITTHRESH ] ; then
        echo "CRITICAL - *$PERCENTINUSE used* | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
        exit $STATE_CRITICAL
    fi
    if [ $CUTPERC -ge $WARNTHRESH ] ; then
        echo "WARNING - *$PERCENTINUSE used* | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
        exit $STATE_WARNING
    fi
fi
#echo "No data returned"
#exit $STATE_UNKNOWN

However, in a SAN environment where every host in an ESX environment has the same LUNs attached for HA, having every host report on space utilisation for the same list of LUNs is a bit over the top, and I'm actually more interested in knowing if a host has lost its path(s) to a LUN.

So I thought I would make a small modification to the script so that it would actually complain if the VMFS (LUN) disappeared. In its original form the script would produce no output to cover this eventuality, and I thought I could see why: the two commented lines just before the end of the while loop needed to be uncommented, changed to report a missing vmfs and moved outside the loop to pick up anything that didn't hit any of the exit statements. It should have taken about 30 seconds.

Well I spent about an hour grappling with a very curious symptom: if any of the conditions inside the loop were met, the script would provide the correct response, but then, instead of exiting, would appear to carry on executing any statements after the loop. This meant that the script would report that it had found a vmfs volume AND report that it couldn't find it. I tried setting variables inside the loop to pick up back in the main section, even exporting them to make them exist outside the script; but I seemed to be stuck with some sort of scoping problem.

It was only when I came across an article by Craig Russell on BASH variable scope inside a While loop, that I realised what the problem was: the while loop was sitting on the end of a pipe and in Bash that meant that the while loop was running in a different process. So the exit statements and any variables set inside the loop have no effect in the outer script.

Craig had a reasonable alternative - pipe out to a temporary file and hang the while loop onto the file instead. I was then able to implement my properly performing script:

#!/bin/sh
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3

MYVOL=$1
WARNTHRESH=$2
CRITTHRESH=$3
RET=$?

if [[ $RET -ne 0 ]]
then
  echo "query problem - No data received from host"
  exit $STATE_UNKNOWN
fi

vdf -h -P | grep -E '^/vmfs/volumes/' | awk '{ print $2 " " $3 " " $4 " " $5 " " $6 }' >/tmp/vmfslist.tmp

while read output
  do
  DISKSIZE=$(echo $output | awk '{ print $1 }' )
  DISKUSED=$(echo $output | awk '{ print $2 }' )
  DISKAVAILABLE=$(echo $output | awk '{ print $3 }' )
  PERCENTINUSE=$(echo $output | awk '{ print $4 }' )
  VOLNAME=$(echo $output | awk '{ print $5 }' )
  CUTPERC=$(echo $PERCENTINUSE | cut -d'%' -f1 )
  if [ "/vmfs/volumes/$MYVOL" = $VOLNAME ]
    then
    if [ $CUTPERC -lt $WARNTHRESH ]
      then
      echo "OK - $PERCENTINUSE used | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
      exit $STATE_OK
    elif [ $CUTPERC -ge $CRITTHRESH ]
      then
      echo "CRITICAL - *$PERCENTINUSE used* | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
      exit $STATE_CRITICAL
    elif [ $CUTPERC -ge $WARNTHRESH ]
      then
      echo "WARNING - *$PERCENTINUSE used* | Volume=$MYVOL Size=$DISKSIZE Used=$DISKUSED Available=$DISKAVAILABLE PercentUsed=$PERCENTINUSE"
      exit $STATE_WARNING
    fi
  fi
done < /tmp/vmfslist.tmp

echo "No data returned. VMFS Unavailable?"
exit $STATE_CRITICAL

The lesson? There's no real substitute for knowing how stuff works... properly.

Nagios AD Authentication

2011-08-30T13:44:00.000-07:00

Installing a new Nagios/MRTG/OpenNMS server recently under SUSE10, I had a few problems getting AD credentials to work consistently across web and ssh interfaces.

While I'm not going to bore with my full step-by-step server build doc, I have extracted the Kerberos steps, hoping it will make someone else's life a little easier.

So we start at the point where we have built and activated the server, configured networking and installed vmware tools, but we haven't installed Nagios.

Time Synchronisation - this is very important as Kerberos only tolerates 5 minutes clock differences. So as well as installing ntp, I still add clock=pit to the end of my kernel= line in /boot/grub/menu.lst

Install Kerberos client through yast networking services. Ensure the default dns domain matches resolv.conf; ensure that the default realm matches your AD, and is in CAPITALS; supply your nearest DC's FQDN as KDC server.

Edit krb5.conf

vi /etc/krb5.conf

(insert after [libdefaults])

dns_lookup_realm = true

dns_lookup_kdc = true

With just this first change, you should be able to authenticate using your AD credentials

SRV09 # kinit -V myadusername
Password for myadusername@MYADREALM.INTERNAL:

Authenticated to Kerberos v5

SRV09 #

If you get any errors containing "...not match expectations..." check your krb5.conf for typos

Now we have to enable kerberos in PAM.

Edit pam_unix2.conf and check that the following lines are present:

vi /etc/security/pam_unix2.conf
auth: use_krb5 nullok

account: use_krb5

password: use_krb5 nulllok

session: none

Every participating AD user must have a local account exactly matching their AD username :

useradd –c “My Full Name” –m myadusername

Finish the job by sorting out su and sudo - add the user to the wheel group:

usermod –G wheel myadusername

Enforce use of the wheel group

chown root:wheel /bin/su

chmod o-rx /bin/su

chmod u+s /bin/su

Configure sudo - I just allow access from the wheel group, but you can configure how you like.

visudo

(modify)

#Defaults targetpw

#ALL ALL=(ALL) ALL

%wheel ALL=(ALL) ALL

At this point you should be able to log on via ssh with your AD credentials and use su to root or execute programs with root privilege through sudo if necessary.

Once you have verified this, you can disable root access to sshd:

vi /etc/ssh/sshd_config

(modify)

PermitRootLogin no

Restart SSHD

/etc/init.d/sshd restart

Now we're ready to take on Apache...

Run yast and install the following packages:

apache2

apache-mod_php5

This gets enough Apache in place to install Nagios.
So we run through the Nagios core and plugins installation until we get to

...

make install-webconf

htpasswd2 –c /usr/local/nagios/etc/htpasswd.users nagiosadmin

Now when we restart Apache we should be able to get to the basic Nagios website, to be prompted for the local nagiosadmin credentials.

/etc.init.d/apache2 restart

Ensure all authenticated users have required accesses

vi /usr/local/nagios/etc/cgi.cfg

(modify)

authorized_for_all_services=*

authorized_for_all_hosts=*

authorized_for_all_service_commands=*

authorized_for_system_information=*

authorized_for_configuration_information=*

Now we need to update Apache to use PAM. First run yast and install the following packages:

apache2-devel

pam-devel

Download and expand the pam module source tar file

cd /var/tmp

wget http://pam.sourceforge.net/mod_auth_pam/dist/mod_auth_pam-2.0-1.1.1.tar.gz

tar xzpf mod_auth_pam-2.0-1.1.1.tar.gz

cd mod_auth_pam/

Correction for Apache 2

vi Makefile

(replace) APXS=apxs

(with) APXS=/usr/sbin/apxs2

Compile and install module

make

make install

Tell Apache about the new module:

vi /etc/sysconfig/apache2

(find line beginning with: APACHE_MODULES and add auth_pam and auth_sys_group to the list)

Change system-wide authentication / authorisation to per-directory

vi /etc/apache2/httpd.conf

(under) <Directory />

(replace) AllowOverride None

(with) AllowOverride AuthConfig

Configure PAM parameters for Apache

vi /etc/pam.d/httpd

(remove everything and insert:)

auth required /lib/security/pam_env.so

auth sufficient /lib/security/pam_krb5.so minimum_uid=1

auth required /lib/security/pam_deny.so

account required /lib/security/pam_krb5.so

Edit Nagios apache configuration to force PAM authentication

vi /etc/apache2/conf.d/nagios.conf

(replace everything in <Directory…> brackets with):

SSLRequireSSL (optional - I'm assuming that you compiled with SSL):

Options ExecCGI

AllowOverride None

Order allow,deny

Allow from all

AuthName "My AD Domain”

AuthPAM_Enabled on

AuthPAM_FallThrough off

AuthBasicAuthoritative off

AuthGROUP_Enabled off

AuthUserFile /dev/null

AuthType Basic

Require valid-user

Restart Apache (check for errors)

/etc/init.d/apache restart

And that's the main enabling configuration done!

So in future, when we want to add an AD user to Nagios, these are the steps we have to carry out:

Create a local username identical to the AD username

useradd –c “User Full Name” –m adusername

Add the username to the wheel group

usermod –G wheel adusername

Add the username to Nagios Contacts

vi /usr/local/nagios/etc/objects/contacts.cfg

define contact{

contact_name adusername

use generic-contact

alias User Full Name

host_notification_commands notify-host-by-email

service_notification_commands notify-service-by-email

email adusermailname@somedomain.net

}

Add the username to Nagios Admins Group

define contactgroup{

contactgroup_name admins

alias Nagios Administrators

members nagiosadmin,adusername

}

Create a Nagios account using the username (password is ignored, leave blank)

htpasswd2 –c /usr/local/nagios/etc/htpasswd.users adusername

If the user is only to have readonly rights:

vi /usr/local/nagios/etc/cgi.cfg

(modify)

authorized_for_read_only=adusername,user2,etc

Restart Nagios

/etc/init.d/nagios restart

Reading the Clipboard from VBScript

2011-08-28T05:11:00.000-07:00

I was recently working on some scripting for managing a couple of hundred Cisco devices, to automate bulk ACL changes, backups and suchlike via PuTTY. At one point I came to the conclusion that it would be useful to be able to access the clipboard from a vbscript running under cscript.exe, and went looking for some starter code. I was surprised to find that no such code existed, or that if it existed, it relied on an external program such as clip.exe.

The problem seems to be that clipboard lives in Gui userland, and my scripts live in Text userland. So the solution I would need to come up with would have to go to the Windows environment in order to access the abstraction that is clipboard, and bring it back to my text environment.

My first cut solution uses a pseudo-netsocket approach - spawn another process and establish two-way communication using PIDs, then paste into its windows interface, and have it send what it receives to my stdin.

It works surprisingly well (in my environment) and although ultimately I decided not to use it for my Cisco project, I have added it to my libaries for process management. Here it is with only minimal error checking for clarity.

Option Explicit

'===========================================================================
'Name:    GetClipBoard() function
'Author:  Philip Damian-Grint
'Version: 1.0
'Date:    28th Aug 2011
'
'Description:
'
'  From a vbs script running under cscript.exe, read the contents of the 
'  Clipboard into a string.
'===========================================================================

Function GetClipBoard

    ' First we create a text file to hold our child
    dim objFS : Set objFS = CreateObject("Scripting.FileSystemObject")
    dim strFName : strFName = objFS.GetTempName
    dim objTS : Set objTS = objFS.CreateTextFile( strFName, True )

    ' Our child requests her parent's PID, and then provides a paste buffer, all off-screen
    objTS.WriteLine("dim pid : pid=inputbox(""PID"",,,0,-3000) : " & _
            "dim str : str=inputbox(""STR"",,,0,-3000) : " & _
            "set shell=wscript.createobject(""wscript.shell"") : " & _
            "shell.appactivate pid : wscript.sleep 100 : " & _
            "shell.sendkeys str & ""{ENTER}""")
    objTS.Close

    ' Spawn our child as a running process
    Dim objWshShell : Set objWshShell = WScript.CreateObject("WScript.Shell")
    Dim objChild : Set objChild = objWshShell.exec( "cscript.exe //E:vbscript " & strFName )
    Dim intChildPID : intChildPID = clng(objChild.ProcessID)

    ' Now use our child's PID to find our own
    Dim strObjPath : strObjPath = "winmgmts:{impersonationLevel=impersonate}!\\.\root\cimv2"
    Dim objProcess, intParentPID

    For Each objProcess In getObject( strObjPath ).instancesOf("Win32_Process")
        If intChildPID = (clng(objProcess.processID)) Then
            intParentPID= objProcess.parentProcessID : Exit For
        End If
    Next

    ' Find our child's first input box, and write our PID to it
    Do until objWshShell.AppActivate( intChildPID )
        WScript.Sleep 100
    Loop : objWshShell.SendKeys intParentPID & "{ENTER}"

    ' Find our child's second input box, and paste the clipboard contents
    Do Until objWshShell.AppActivate( intChildPID )
        WScript.Sleep 100
    Loop : objWshShell.SendKeys "^v{ENTER}"

    ' Receive the paste buffer contents from our child
    GetClipBoard  = WScript.StdIn.ReadLine

    ' And clear up after our child
    objFS.DeleteFile strFName, True

End Function

' Demonstrate the function

wscript.echo "We read: " & GetClipBoard() & " from the clipboard"

MRTG Percentile Calculation

2010-12-08T10:12:00.000-08:00

I was recently asked to provide 95th percentile utilisation figures for around 50 WAN interfaces on our network. I've been using MRTG for years, and assumed someone would have contributed something which I could use or customise, but I found nothing.

This then, is my fairly basic hack for processing mrtg-2 log files and calculating the required information.

It's in Perl and contains more documentation than code... You will note that I wasn't brave (stupid?) enough to write my own percentile algorithm...

The actual calculation code is trivial - most of the code is contriving to implement a primitive weighting system to cope with samples covering variable time periods. The code has been tested on Windows under ActivePerl 5.2.12.

#!/usr/bin/env perl
# NAME:   mrtg-ptile.pl
#
# AUTHOR:  Philip Damian-Grint
#
# DESCRIPTION: 
#    Generate percentile calculations for in and out values found
#    in an MRTG log file (version 2).
#    (See http://oss.oetiker.ch/mrtg/doc/mrtg-logfile.en.html)
#
#    In general there are 600 samples each of 5mins, 30mins, 120mins
#    and 86400mins. Each dataset is a quintuple:
#    {epoch, in_average, out_average, in_maximum, out_maximum}
#
#    We want to be able to ask for a variable percentile over a variable
#    length of time stretching back from now.
#
#    We keep track of the effective elapsed time as we go back through the
#    log file. Examination of log files shows that there can be a number
#    of inconsistencies such as variations in timestamp greater or
#    less than expected, and a greater or less number of samples in each
#    bracket.
#
#    To overcome this we compare each timestamp with the previous, and
#    divide it by 300 (seconds) rounded up. The values are repeated the
#    number of times indicated by the dividend.
#
#    So each 5 minute value set will be added once, each 30 minute value set
#    will be added 6 times, and each 2 hour value set will be added 24 times
#    so that we have a number of datasets equivalent to the number of 5min
#    chunks evenly spread over the period being evaluated.
#
# INPUTS:  
#    Logfile (I haven't coded for wildcards)
#    Percentile (I restrict this to an integer between 1 and 99)
#    Time period (I restrict this to 90 days or less)
#
#    The command line arguments are:
#    mrtg-ptile.pl --logfile={filename} \
#                  --percentile={0>x<100} \
#                  --period={y days} \
#                  --verbose
#    Where "\" indicates line wrap.
#
# OUTPUTS:  
#    Percentiles for average bytes per second in, out, maximum in
#    and maximum out
#    These 4 values are output to STDOUT
#
# NOTES:  
#    1. The percentile figures output are based on the figures input,
#       and on the units input. If these relate to router interfaces,
#       they will normally represent bytes per second.
#    2. It should go without saying that running this against live log
#       files while MRTG is running will have unpredictable results.
#       Copy the logfiles to a location where they will not be disturbed
#       while being processed.
#
# HISTORY:  7/12/2010: v1.0 created
#

# PRAGMAS
use strict;

#
# PACKAGES
use Getopt::Long;
use Statistics::Descriptive;

#
# VARIABLES

# Parameters
my $logfile;      # Name of logfile to process
my $percentile;   # Percentile to calculate
my $period;       # Length of time in 24 hour days
my $verbose;      # Flag to request diagnostic information

#
# Working Storage
my $elapsed;      # Seconds between current and previous record's epoch times
my $first_line;   # Used to skim off the first (unused) line in the log file
my $i;            # General purpose loop counter variable
my $in_avg;       # Bps value from field 2 in the current record
my $in_max;       # Bps value from field 4 in the current record
my $inavgstat;    # Statistics::Descriptive object for average IN values
my $inmaxstat;    # Statistics::Descriptive object for maximum IN values
my $last_time;    # Epoch timestamp from the previous record
my $multiplier;   # Number of 5 minute slots represented by the current record
my $out_avg;      # Bps value from field 3 in the current record
my $out_max;      # Bps value from field 5 in the current record
my $outavgstat;   # Statistics::Descriptive object for average OUT values
my $outmaxstat;   # Statistics::Descriptive object for maximum OUT values
my $percentile;   # Contents of the --percentile= command line parameter
my $period;       # Contents of the --period= command line parameter
my $samplesecs;   # Remaining (reporting) period in seconds
my $time;         # Epoch time value from field 1 in the current record

#
# Check that we were called intelligently

GetOptions ("logfile=s" => \$logfile,
   "percentile=i" => \$percentile,
   "period=i" => \$period,
   "verbose" => \$verbose );

if (!($logfile) || !($percentile) || !($period)) {
   die "\nUsage: mrtg-ptile.pl \t--logfile={filename}".
       " \\\n\t\t\t--percentile={integer}".
       " \\\n\t\t\t--period={integer days}\n";
}

#
# Sanity checks on numbers
if ($percentile < 1 || $percentile > 99) {
   die "Percentile must lie between 1 and 99";
}
if ($period > 90) {
   die "Period cannot be greater than 90 days";
} # Only 'cos some of my data older than this is mangled :)

#
# INITIALISATION
$elapsed = 0;                           # Zero elapsed time tracker
open(FILE, "$logfile") or die("Couldn't open file: $logfile \n");
$first_line = <FILE>;             # get header line out of the way
($last_time) = split(" ", $first_line); # capture the first sample time
$samplesecs = $period * 24 * 3600;      # Set up countdown timer

$inavgstat = Statistics::Descriptive::Full->new(); # Initialise stats objects
$inmaxstat = Statistics::Descriptive::Full->new();
$outavgstat = Statistics::Descriptive::Full->new();
$outmaxstat = Statistics::Descriptive::Full->new();

#
# MAIN
while (<FILE>) {
   # Split up our tuple
   ($time, $in_avg, $out_avg, $in_max, $out_max) = (split)[0,1,2,3,4];
   $multiplier = int($elapsed/300);     # Count 5 minute slots

   if ( $samplesecs > $elapsed) {       # if we haven't run out of time...
      $elapsed = $last_time - $time;    # Count elapsed seconds
      $samplesecs -= $elapsed;          # Adjust remaining time period

      if ($verbose) {
         print "Time: ", $time."(".$last_time.
         "), In_Avg: ".$in_avg.", Out_Avg: ".$out_avg.
         ", In_Max: ".$in_max.", Out_Max: ".$out_max.
         ", Elapsed: ".$elapsed.": Post ".$multiplier." times, ".
         $samplesecs . " seconds of samples left\n";
      }

      $last_time = $time;               # track for the next sample
      # post the sample once for every elapsed 5 minutes
      for ($i=1; $i<=$multiplier; $i++) {
         $inavgstat->add_data($in_avg);
         $inmaxstat->add_data($in_max);
         $outavgstat->add_data($out_avg);
         $outmaxstat->add_data($out_max);
      }
   }
}# FINISH
close(FILE);

# Check to see if we ran out of samples
if ($samplesecs > $elapsed) {
   print "Warning: not enough samples found to cover requested period\n";
}

# Output our percentiles
print "In_Avg: ".$inavgstat->percentile($percentile).
      ", Out_Avg: ".$outavgstat->percentile($percentile).
      ", In_Max: ".$inmaxstat->percentile($percentile).
      ", Out_Max: " . $outmaxstat->percentile($percentile);