How to aggregate weekly data to create custom statistics

Recently, I have parsed logs of several applications to generate custom weekly reports. It was very interesting exercise. I created two shell scripts to illustrate the whole idea by parsing HAProxy log files, so I could remember it in the future.

Display top 404 pages

Shell script used to display top 404 pages for last three weeks for web frontend and blog, statistics backends.

#!/bin/bash
# Display weekly top 404 requests for n previous weeks

# number of previous weeks
number_of_weeks="3"

# directory to keep aggregated data
aggregated_logs_directory="/tmp/aggregated"

# application name
application="haproxy"

# application log files
log_filename="/var/log/haproxy.log*"

# date format to search for: [15/Mar/2018:
file_log_date_format="\[%d/%b/%Y:"

# file types to search for: [a-Z0-9]\+\.\(php\|html\|txt\|png\)
file_types="php html txt png"

# frontends to filter
limit_frontends="^web$"
#limit_frontends=".*"

# backends to filter
limit_backends="blog\|statistics"
#limit_backends=".*"

# Print current date
echo "Current date: $(date)"
echo

# create aggregated log directory if it is missing
if [ ! -d "${aggregated_logs_directory}" ]; then
  echo "Creating aggregated log directory \"${aggregated_logs_directory}\""
  mkdir "${aggregated_logs_directory}"
else
  echo "Using aggregated log directory \"${aggregated_logs_directory}\""
fi

# loop over previous weeks
for n_weeks_ago in $(seq 1 ${number_of_weeks}); do
  # define pretty date from/to
  loop_pretty_date_from=$(date +%d/%b/%Y --date "last monday - ${n_weeks_ago} week + 0 day")
  loop_pretty_date_to=$(date +%d/%b/%Y --date "last monday - ${n_weeks_ago} week + 6 day")

  # define machine date from/to
  loop_txt_date_from=$(date +%Y%m%d --date "last monday - ${n_weeks_ago} week + 0 day")
  loop_txt_date_to=$(date +%Y%m%d --date "last monday - ${n_weeks_ago} week + 6 day")

  # define log filename
  aggregated_log_filename="${application}_${loop_txt_date_from}-${loop_txt_date_to}.log"

  # aggregate data
  if [ ! -f "${aggregated_logs_directory}/${aggregated_log_filename}" ]; then
    echo "Creating ${aggregated_log_filename} log file to store data from ${loop_pretty_date_from} to ${loop_pretty_date_to}"
    for weekday in $(seq 0 6); do
      zgrep $(date +${file_log_date_format} --date "last monday - ${n_weeks_ago} weeks + ${weekday} days") ${log_filename} | tee -a ${aggregated_logs_directory}/${aggregated_log_filename} >/dev/null
    done
  else
    echo "Using existing ${aggregated_log_filename} log file that contain data from ${loop_pretty_date_from} to ${loop_pretty_date_to}"
  fi

  # parse data
  if [ -f "${aggregated_logs_directory}/${aggregated_log_filename}" ]; then
    echo "Parsing data from ${loop_pretty_date_from} to ${loop_pretty_date_to} (${n_weeks_ago} week/weeks ago)"

    # filter frontends
    frontends=$(awk '{if ($8 !~ ":" && $8 !~ "~"  && !seen_arr[$8]++) print $8}' ${aggregated_logs_directory}/${aggregated_log_filename} | grep "${limit_frontends}")

    # filter backends and highlight nosrv
    backends=$(awk '{split($9,backend,"/");if ($8 !~ ":" && !seen_arr[backend[1]]++) {if (backend[2] !~ "NOSRV" )  print backend[1]; else print "NOSRV";}}' ${aggregated_logs_directory}/${aggregated_log_filename} | grep "${limit_backends}" | sort)

    # parse each log file for top 404 pages
    for frontend in ${frontends}; do
      echo "${frontend} frontend"
      for backend in ${backends}; do
        echo "->${backend}"
        if [ "${backend}" = "NOSRV" ]; then
          not_found_list=$(grep "${frontend}\([~]\)\? ${frontend}/"  ${aggregated_logs_directory}/${aggregated_log_filename} | awk '$11 == "404" {query=substr($0,index($0,$18)); print query}' | sort  | uniq -c | sort -hr | head)
        else
          not_found_list=$(grep "${frontend}\([~]\)\? ${backend}/"  ${aggregated_logs_directory}/${aggregated_log_filename} | awk '$11 == "404" {query=substr($0,index($0,$18)); print query}' | sort  | uniq -c | sort -hr | head)
        fi

        if [ -z "$not_found_list" ]; then
          echo "  --- none ---"
        else
          echo  "$not_found_list"
        fi
      done
    done
    echo
  fi
done

Sample output.

Current date: Fri Mar 16 19:06:41 CET 2018

Creating aggregated log directory "/tmp/aggregated"
Creating haproxy_20180305-20180311.log log file to store data from 05/Mar/2018 to 11/Mar/2018
Parsing data from 05/Mar/2018 to 11/Mar/2018 (1 week/weeks ago)
web frontend
->web-blog-production
    892 "GET /wp-login.php HTTP/1.1"
    596 "GET /apple-touch-icon.png HTTP/1.1"
    560 "GET /apple-touch-icon-precomposed.png HTTP/1.1"
    470 "GET /xfavicon.png.pagespeed.ic.ITJELUENXe.png HTTP/1.1"
     74 "GET /assets/images/blog_sleeplessbeastie_eu_image.png HTTP/1.1"
     72 "GET /tags/index.php HTTP/1.0"
     72 "GET /index.php HTTP/1.0"
     66 "GET /2013/01/21/how-to-automate-mouse-and-keyboard/index.php HTTP/1.0"
     66 "GET /01/21/how-to-automate-mouse-and-keyboard/index.php HTTP/1.0"
     40 "GET /favicon.png.pagespeed.ce.I9KrGowxSl.png HTTP/1.1"
->web-statistics-production
  --- none ---

Creating haproxy_20180226-20180304.log log file to store data from 26/Feb/2018 to 04/Mar/2018
Parsing data from 26/Feb/2018 to 04/Mar/2018 (2 week/weeks ago)
web frontend
->web-blog-production
   1012 "GET /wp-login.php HTTP/1.1"
    568 "GET /apple-touch-icon.png HTTP/1.1"
    554 "GET /apple-touch-icon-precomposed.png HTTP/1.1"
    502 "GET /xfavicon.png.pagespeed.ic.ITJELUENXe.png HTTP/1.1"
     72 "GET /tags/index.php HTTP/1.0"
     72 "GET /index.php HTTP/1.0"
     72 "GET /assets/images/blog_sleeplessbeastie_eu_image.png HTTP/1.1"
     44 "GET /favicon.png.pagespeed.ce.I9KrGowxSl.png HTTP/1.1"
     26 "HEAD /apple-touch-icon-precomposed.png HTTP/1.1"
     26 "HEAD /apple-touch-icon.png HTTP/1.1"
->web-statistics-production
  --- none ---

Creating haproxy_20180219-20180225.log log file to store data from 19/Feb/2018 to 25/Feb/2018
Parsing data from 19/Feb/2018 to 25/Feb/2018 (3 week/weeks ago)
web frontend
->web-blog-production
   1068 "GET /wp-login.php HTTP/1.1"
    846 "GET /apple-touch-icon.png HTTP/1.1"
    816 "GET /apple-touch-icon-precomposed.png HTTP/1.1"
    134 "GET /xfavicon.png.pagespeed.ic.ITJELUENXe.png HTTP/1.1"
     66 "GET /tags/index.php HTTP/1.0"
     66 "GET /index.php HTTP/1.0"
     44 "GET /2013/01/21/how-to-automate-mouse-and-keyboard/index.php HTTP/1.0"
     42 "GET /01/21/how-to-automate-mouse-and-keyboard/index.php HTTP/1.0"
     40 "GET /assets/images/blog_sleeplessbeastie_eu_image.png HTTP/1.1"
     32 "HEAD /apple-touch-icon-precomposed.png HTTP/1.1"
->web-statistics-production
      4 "HEAD /https://statistics.sleeplessbeastie.eu/ HTTP/1.1"
      4 "GET /rules.abe HTTP/1.1"

Display specified file types occurence

Shell script used to display weekly statistics for specified file types occurence.

#!/bin/bash
# Display weekly statistics for several file types for n previous weeks

# diplay mode
# 1 - pretty
# 2 - regular
display_mode="1"

# number of previous weeks
number_of_weeks="3"

# directory to keep aggregated data
aggregated_logs_directory="/tmp/aggregated"

# application name
application="haproxy"

# application log files
log_filename="/var/log/haproxy.log*"

# date format to search for: [15/Mar/2018:
file_log_date_format="\[%d/%b/%Y:"

# file types to search for: [a-Z0-9]\+\.\(php\|html\|txt\|png\)
file_types="php html txt png"

# frontends to filter
limit_frontends="^web$"
#limit_frontends=".*"

# backends to filter
limit_backends="NOSRV\|blog\|statistics"
#limit_backends=".*"

# Print current date
echo "Current date: $(date)"
echo

# create aggregated log directory if it is missing
if [ ! -d "${aggregated_logs_directory}" ]; then
  if [ "${display_mode}" -eq "1" ]; then
    echo "Creating aggregated log directory \"${aggregated_logs_directory}\""
  fi
  mkdir "${aggregated_logs_directory}"
else
  if [ "${display_mode}" -eq "1" ]; then
    echo "Using aggregated log directory \"${aggregated_logs_directory}\""
  fi
fi

# loop over previous weeks
for n_weeks_ago in $(seq 1 ${number_of_weeks}); do
  # define pretty date from/to
  loop_pretty_date_from=$(date +%d/%b/%Y --date "last monday - ${n_weeks_ago} week + 0 day")
  loop_pretty_date_to=$(date +%d/%b/%Y --date "last monday - ${n_weeks_ago} week + 6 day")

  # define machine date from/to
  loop_txt_date_from=$(date +%Y%m%d --date "last monday - ${n_weeks_ago} week + 0 day")
  loop_txt_date_to=$(date +%Y%m%d --date "last monday - ${n_weeks_ago} week + 6 day")

  # define log filename
  aggregated_log_filename="${application}_${loop_txt_date_from}-${loop_txt_date_to}.log"

  # aggregate data
  if [ ! -f "${aggregated_logs_directory}/${aggregated_log_filename}" ]; then
    if [ "${display_mode}" -eq "1" ]; then
      echo "Creating ${aggregated_log_filename} log file to store data from ${loop_pretty_date_from} to ${loop_pretty_date_to}"
    fi
    for weekday in $(seq 0 6); do
      zgrep $(date +${file_log_date_format} --date "last monday - ${n_weeks_ago} weeks + ${weekday} days") ${log_filename} | tee -a ${aggregated_logs_directory}/${aggregated_log_filename} >/dev/null
    done
  else
    if [ "${display_mode}" -eq "1" ]; then
      echo "Using existing ${aggregated_log_filename} log file that contain data from ${loop_pretty_date_from} to ${loop_pretty_date_to}"
    fi
  fi

  # parse data
  if [ -f "${aggregated_logs_directory}/${aggregated_log_filename}" ]; then
    if [ "${display_mode}" -eq "1" ]; then
      echo "Parsing data from ${loop_pretty_date_from} to ${loop_pretty_date_to} (${n_weeks_ago} week/weeks ago)"
    fi

    # filter frontends
    frontends=$(awk '{if ($8 !~ ":" && $8 !~ "~"  && !seen_arr[$8]++) print $8}' ${aggregated_logs_directory}/${aggregated_log_filename} | grep "${limit_frontends}")

    # filter backends
    #backends=$(awk '{split($9,backend,"/");if ($8 !~ ":" && !seen_arr[backend[1]]++) print backend[1]}' ${aggregated_logs_directory}/${aggregated_log_filename} | grep "${limit_backends}")
    # highlight nosrv
    backends=$(awk '{split($9,backend,"/");if ($8 !~ ":" && !seen_arr[backend[1]]++) {if (backend[2] !~ "NOSRV" )  print backend[1]; else print "NOSRV";}}' ${aggregated_logs_directory}/${aggregated_log_filename} | grep "${limit_backends}" | sort)

    # parse each file type/element
    for frontend in ${frontends}; do
      if [ "${display_mode}" -eq "1" ]; then
        echo "${frontend} frontend"
      fi
      for backend in ${backends}; do
        if [ "${display_mode}" -eq "1" ]; then
          echo "->${backend}"
        fi
        for element in ${file_types}; do
          if [ "${backend}" = "NOSRV" ]; then
            count=$(grep "${frontend}\([~]\)\? ${frontend}/<NOSRV>"  ${aggregated_logs_directory}/${aggregated_log_filename} | grep -c "[a-Z0-9]\+\.${element}")
          else
            # grep for frontend and frontend~ (ssl)
            count=$(grep "${frontend}\([~]\)\? ${backend}/"  ${aggregated_logs_directory}/${aggregated_log_filename} | grep -c "[a-Z0-9]\+\.${element}")
          fi
          if [ "${display_mode}" -eq "2" ]; then
            echo "${loop_pretty_date_from} - ${loop_pretty_date_to} (${n_weeks_ago} week/weeks ago) ${frontend}->${backend}: ${element} file found ${count} times"
          elif [ "${display_mode}" -eq "1" ]; then
            if [ "${count}" -gt "0" ]; then
              echo "  ${element} file found ${count} times"
            fi
          fi
        done
      done
    done
    echo
  fi
done

Sample output.

Current date: Fri Mar 16 19:27:59 CET 2018

Creating aggregated log directory "/tmp/aggregated"
Creating haproxy_20180305-20180311.log log file to store data from 05/Mar/2018 to 11/Mar/2018
Parsing data from 05/Mar/2018 to 11/Mar/2018 (1 week/weeks ago)
web frontend
->NOSRV
  php file found 2030 times
  html file found 8272 times
  txt file found 2622 times
  png file found 1044 times
->web-blog-production
  php file found 1184 times
  html file found 206 times
  txt file found 2602 times
  png file found 160770 times
->web-statistics-production
  php file found 360992 times
  html file found 608 times
  txt file found 50 times
  png file found 836 times

Creating haproxy_20180226-20180304.log log file to store data from 26/Feb/2018 to 04/Mar/2018
Parsing data from 26/Feb/2018 to 04/Mar/2018 (2 week/weeks ago)
web frontend
->NOSRV
  php file found 1822 times
  html file found 9682 times
  txt file found 2722 times
  png file found 950 times
->web-blog-production
  php file found 1276 times
  html file found 216 times
  txt file found 2604 times
  png file found 159288 times
->web-statistics-production
  php file found 269462 times
  html file found 822 times
  txt file found 52 times
  png file found 1108 times

Creating haproxy_20180219-20180225.log log file to store data from 19/Feb/2018 to 25/Feb/2018
Parsing data from 19/Feb/2018 to 25/Feb/2018 (3 week/weeks ago)
web frontend
->NOSRV
  php file found 2028 times
  html file found 10712 times
  txt file found 2956 times
  png file found 796 times
->web-blog-production
  php file found 1376 times
  html file found 360 times
  txt file found 2816 times
  png file found 166808 times
->web-statistics-production
  php file found 352380 times
  html file found 1218 times
  txt file found 98 times
  png file found 1278 times

These shell scripts are here to merely illustrate the whole idea of generating weekly reports from existing log files, so you can improve them further.

About Milosz Galazka

Milosz is a Linux Foundation Certified Engineer working for a successful Polish company as a system administrator and a long time supporter of Free Software Foundation and Debian operating system.