Should sysadmins use python instead of linux tools ?

Hi everybody. Many times, I got some messages from visitors of this blog who cannot read french fluently and who were asking for articles in english. So I’ve decided to write this first post in english.

For many years, I’ve been using the whole set of linux tools to take care of my daily tasks as a sysadmin. Sed, awk, cut, along with regular expressions, are extremely powerful tools to process raw data (log files for instance) and present them in a suitable format.

But two years ago, as I was learning python, I noticed it was not just appropriate to developing complex programs but it is  also perfectly suitable to achieve my sysadmin work, as it is much faster to complete some of your automated tasks than traditional linux tools.

In this post, let’s just compare both traditional way to deal with log files, vs a python approach to this. I hope you’ll be as convinced as I am.

Let’s take a quick example. I currently have an apache log file composed of 1 million lines. The logs are stored in combined format.

...
5.255.253.44 - - [23/Jul/2016:07:18:33 +0200] "GET /ul/4-air.png HTTP/1.1" 304 186 "-" "Mozilla...
107.23.92.82 - - [23/Jul/2016:07:18:57 +0200] "GET /feed/ HTTP/1.1" 200 7572 "-" "Mozilla...
5.255.253.44 - - [23/Jul/2016:07:20:38 +0200] "GET /ul/wrt54G-wireless-sec.jpg HTTP/1.1" 304 185 "-" "Mozilla...
52.30.101.176 - - [23/Jul/2016:07:20:53 +0200] "GET /robots.txt HTTP/1.1" 200 370 "-" "Mozilla...
146.185.251.139 - - [23/Jul/2016:07:21:49 +0200] "POST /xmlrpc.php HTTP/1.0" 404 8603 "-" "Mozilla...
5.196.4.212 - - [23/Jul/2016:07:22:02 +0200] "GET /feed/ HTTP/1.1" 304 177 "-" "Feed2Imap...
37.187.101.2 - - [23/Jul/2016:07:22:02 +0200] "GET /feed/ HTTP/1.1" 304 158 "-" "Tiny...
5.255.253.44 - - [23/Jul/2016:07:22:43 +0200] "GET /ul/5.jpg HTTP/1.1" 304 185 "-" "Mozilla...
5.255.253.44 - - [23/Jul/2016:07:24:48 +0200] "GET /ul/3-active-console.png HTTP/1.1" 304 186 "-" "Mozilla...
...

Let’s say we want to count the number of times a single IP address accessed a resource on the server, keeping just the 10 most frequent visitors. We would like to present the information in such a form, to create a CSV file for instance.

5.196.197.68;11820
104.236.104.28;7740
107.23.92.82;6940
...

Bash approach

To do it in bash, we use several tools and we join them together with pipe character.

cat access1M.log | cut -d" " -f1 | sort | uniq -c | sed -e "s/^ *//" | sort -rn | \
   awk -F" " '{print $2";"$1}'| head -n10

Eventually we are presented with this result :

91.134.167.121|22750
146.185.251.139|16070
146.185.251.137|13240
5.196.197.68|11820
104.236.104.28|7740
107.23.92.82|6940
37.187.109.125|6250
5.196.4.212|5940
37.187.101.2|5940
176.159.66.16|5270

Processing those 1 million lines of log took 17,6 seconds on my Macbook pro (3,1 GHz Core i7 16GB ram 1TB SSD)

real 0m17.671s
user 0m18.353s
sys 0m0.342s

Python approach

Now, let’s try to compute the same log file with python. I wrote a little python script that aims at doing the same processing than above.

#!/usr/bin/python
# ----------------------------------------------------
# Name : 10MostFrequentVisitors.py
# ----------------------------------------------------
import re

visits={}
replacements=[["\[|\]", ""]]

try:
    fh=open("annexe/access1M.log","r")
except Exception as e:
    print "error opening file"
else:
    for line in fh:
        for replacement in replacements:
            src_str,dst_str=replacement
            line=re.sub(src_str, dst_str,line)
            values=line.split(" ")
            ip=values[0]

            if ip in visits:
                visits[ip]+=1
            else:
                visits[ip]=1

visits_sorted=sorted(visits, key=visits.__getitem__, reverse=True)

for cpt in range(0,10):
    ip=visits_sorted[cpt]
    print "{0};{1}".format(ip,visits[ip])

Of course, I obtain the same output here. But in just 6,24s. Python was 2,83 times faster than bash processing the data.

real 0m6.244s
user 0m5.980s
sys 0m0.154s

Of course writing a script in python usually takes a longer time than combining ten commands with the use of pipe characters. But how many times are you doing the same tasks again and again ? Is it not worth writing a script once and for all ?

I hope you will be convinced now that sometimes, writing a python script can be a substantial gain of time when it comes to repetitive tasks, completed on a frequent (or even regular) basis.

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *