Apache access log url type counter

We can not always set our monitoring system to detect each different error type which may occur on a vast web application platform.
From time to time we need to get our hands dirty, open the good old console and take a look inside the apache log files to see what’s going wrong.

Every real boss will ask you from time to time, what the hell are these increasing 404 status codes on our platform, so I wrote a very simple perl script which counts, depending on a pre-defined url depth, how many times this type of url occurs inside the log file.

#!/usr/bin/perl
 
my %countHash;
 
while (<STDIN>) {
  main($_);
}
 
foreach $key (sort hashValueAscendingNum (keys(%countHash))) {
   print "\t$countHash{$key}\t$key\n";
}
 
 
sub main {
  my $logLine = shift;
 
  #(my $domain, my $rfc931, my $authuser, my $TimeDate, my $Request, my $Status, my $Bytes, my $Referrer, my $Agent) = $logLine =~ /^(\S+) (\S+) (\S+) (\[[^\]\[]+\]) \"([^"]*)\" (\S+) (\S+) \"?([^"]*)\"? \"([^"]*)\"/o;
 
  # get the url out of the apache logs
  (my $Request) = $logLine =~ /^\S+ \S+ \S+ \[[^\]\[]+\] \"([^"]*)\" \S+ \S+ \"?[^"]*\"? \"[^"]*\"/o;
 
  # cut off GET / POST .... at the begin of the string and HTTP/1.1 .... at the end
  (my $cleanUrl) = $Request =~ /^[^\s]*\s*(\S+)\s/;
 
  # count similar url classes: here 2. level
  # add '[\/]?[^\/]*' inside the parentesis to differentiate deeper levels
  (my $cmpStr) = $cleanUrl =~ /^(\/[^\/]*[\/]?[^\/]*)/;
 
  if ($cmpStr ne "") {
    if (!$countHash{$cmpStr}) {
      $countHash{$cmpStr} = 1;
    } else {
      $countHash{$cmpStr}++;
    }
  }
}
 
sub hashValueDescendingNum {
   $countHash{$b} <=> $countHash{$a};
}
 
sub hashValueAscendingNum {
   $countHash{$a} <=> $countHash{$b};
}

The script also contains the ‘official’ or more important ‘working’ regular expression to cut down the standard apache log-file-lines into pieces.

Usage:

You have to pipe the log file into this script:

cat /var/log/httpd/access.log|./count_hits.pl

But something very important, you can, thanks to the holy ‘grep’, create lists, with specific error types.
To get a list of the most often hitted 404 status pages:

cat /var/log/httpd/access.log|grep '" 404 '|./count_hits.pl

Or even you can find out how many bad 503 pages were delivered to our all best friend, Mr. Google:

cat /var/log/httpd/access.log|grep '" 500 '|grep -i 'googlebot'|./count_hits.pl