Wednesday 8 December 2010

MRTG Percentile Calculation

I was recently asked to provide 95th percentile utilisation figures for around 50 WAN interfaces on our network. I've been using MRTG for years, and assumed someone would have contributed something which I could use or customise, but I found nothing.

This then, is my fairly basic hack for processing mrtg-2 log files and calculating the required information.
It's in Perl and contains more documentation than code... You will note that I wasn't brave (stupid?) enough to write my own percentile algorithm...

The actual calculation code is trivial - most of the code is contriving to implement a primitive weighting system to cope with samples covering variable time periods. The code has been tested on Windows under ActivePerl 5.2.12.

#!/usr/bin/env perl
# NAME:   mrtg-ptile.pl
#
# AUTHOR:  Philip Damian-Grint
#
# DESCRIPTION: 
#    Generate percentile calculations for in and out values found
#    in an MRTG log file (version 2).
#    (See http://oss.oetiker.ch/mrtg/doc/mrtg-logfile.en.html)
#
#    In general there are 600 samples each of 5mins, 30mins, 120mins
#    and 86400mins. Each dataset is a quintuple:
#    {epoch, in_average, out_average, in_maximum, out_maximum}
#
#    We want to be able to ask for a variable percentile over a variable
#    length of time stretching back from now.
#
#    We keep track of the effective elapsed time as we go back through the
#    log file. Examination of log files shows that there can be a number
#    of inconsistencies such as variations in timestamp greater or
#    less than expected, and a greater or less number of samples in each
#    bracket.
#
#    To overcome this we compare each timestamp with the previous, and
#    divide it by 300 (seconds) rounded up. The values are repeated the
#    number of times indicated by the dividend.
#
#    So each 5 minute value set will be added once, each 30 minute value set
#    will be added 6 times, and each 2 hour value set will be added 24 times
#    so that we have a number of datasets equivalent to the number of 5min
#    chunks evenly spread over the period being evaluated.
#
# INPUTS:  
#    Logfile (I haven't coded for wildcards)
#    Percentile (I restrict this to an integer between 1 and 99)
#    Time period (I restrict this to 90 days or less)
#
#    The command line arguments are:
#    mrtg-ptile.pl --logfile={filename} \
#                  --percentile={0>x<100} \
#                  --period={y days} \
#                  --verbose
#    Where "\" indicates line wrap.
#
# OUTPUTS:  
#    Percentiles for average bytes per second in, out, maximum in
#    and maximum out
#    These 4 values are output to STDOUT
#
# NOTES:  
#    1. The percentile figures output are based on the figures input,
#       and on the units input. If these relate to router interfaces,
#       they will normally represent bytes per second.
#    2. It should go without saying that running this against live log
#       files while MRTG is running will have unpredictable results.
#       Copy the logfiles to a location where they will not be disturbed
#       while being processed.
#
# HISTORY:  7/12/2010: v1.0 created
#

# PRAGMAS
use strict;

#
# PACKAGES
use Getopt::Long;
use Statistics::Descriptive;

#
# VARIABLES

# Parameters
my $logfile;      # Name of logfile to process
my $percentile;   # Percentile to calculate
my $period;       # Length of time in 24 hour days
my $verbose;      # Flag to request diagnostic information

#
# Working Storage
my $elapsed;      # Seconds between current and previous record's epoch times
my $first_line;   # Used to skim off the first (unused) line in the log file
my $i;            # General purpose loop counter variable
my $in_avg;       # Bps value from field 2 in the current record
my $in_max;       # Bps value from field 4 in the current record
my $inavgstat;    # Statistics::Descriptive object for average IN values
my $inmaxstat;    # Statistics::Descriptive object for maximum IN values
my $last_time;    # Epoch timestamp from the previous record
my $multiplier;   # Number of 5 minute slots represented by the current record
my $out_avg;      # Bps value from field 3 in the current record
my $out_max;      # Bps value from field 5 in the current record
my $outavgstat;   # Statistics::Descriptive object for average OUT values
my $outmaxstat;   # Statistics::Descriptive object for maximum OUT values
my $percentile;   # Contents of the --percentile= command line parameter
my $period;       # Contents of the --period= command line parameter
my $samplesecs;   # Remaining (reporting) period in seconds
my $time;         # Epoch time value from field 1 in the current record

#
# Check that we were called intelligently

GetOptions ("logfile=s" => \$logfile,
   "percentile=i" => \$percentile,
   "period=i" => \$period,
   "verbose" => \$verbose );

if (!($logfile) || !($percentile) || !($period)) {
   die "\nUsage: mrtg-ptile.pl \t--logfile={filename}".
       " \\\n\t\t\t--percentile={integer}".
       " \\\n\t\t\t--period={integer days}\n";
}

#
# Sanity checks on numbers
if ($percentile < 1 || $percentile > 99) {
   die "Percentile must lie between 1 and 99";
}
if ($period > 90) {
   die "Period cannot be greater than 90 days";
} # Only 'cos some of my data older than this is mangled :)

#
# INITIALISATION
$elapsed = 0;                           # Zero elapsed time tracker
open(FILE, "$logfile") or die("Couldn't open file: $logfile \n");
$first_line = <FILE>;             # get header line out of the way
($last_time) = split(" ", $first_line); # capture the first sample time
$samplesecs = $period * 24 * 3600;      # Set up countdown timer

$inavgstat = Statistics::Descriptive::Full->new(); # Initialise stats objects
$inmaxstat = Statistics::Descriptive::Full->new();
$outavgstat = Statistics::Descriptive::Full->new();
$outmaxstat = Statistics::Descriptive::Full->new();

#
# MAIN
while (<FILE>) {
   # Split up our tuple
   ($time, $in_avg, $out_avg, $in_max, $out_max) = (split)[0,1,2,3,4];
   $multiplier = int($elapsed/300);     # Count 5 minute slots

   if ( $samplesecs > $elapsed) {       # if we haven't run out of time...
      $elapsed = $last_time - $time;    # Count elapsed seconds
      $samplesecs -= $elapsed;          # Adjust remaining time period

      if ($verbose) {
         print "Time: ", $time."(".$last_time.
         "), In_Avg: ".$in_avg.", Out_Avg: ".$out_avg.
         ", In_Max: ".$in_max.", Out_Max: ".$out_max.
         ", Elapsed: ".$elapsed.": Post ".$multiplier." times, ".
         $samplesecs . " seconds of samples left\n";
      }

      $last_time = $time;               # track for the next sample
      # post the sample once for every elapsed 5 minutes
      for ($i=1; $i<=$multiplier; $i++) {
         $inavgstat->add_data($in_avg);
         $inmaxstat->add_data($in_max);
         $outavgstat->add_data($out_avg);
         $outmaxstat->add_data($out_max);
      }
   }
}# FINISH
close(FILE);

# Check to see if we ran out of samples
if ($samplesecs > $elapsed) {
   print "Warning: not enough samples found to cover requested period\n";
}

# Output our percentiles
print "In_Avg: ".$inavgstat->percentile($percentile).
      ", Out_Avg: ".$outavgstat->percentile($percentile).
      ", In_Max: ".$inmaxstat->percentile($percentile).
      ", Out_Max: " . $outmaxstat->percentile($percentile);

No comments:

Post a Comment