Implementation : History Log Terminal Analysis

September 25, 2022 7 minute read

Implementation

In this study we focused on how the input data being processed and finally came as output. The input data is unstructured and the expected output is structured data. Given the Design system above using ETL in this implementation phase divided into 2 main parts : backend and frontend. Before that will be describe each technology stack use in this implementation.

Stack Use

Mysql

The database served is MySQL. MySQL is an open-source relational database management system. Its name is a combination of “My”, the name of co-founder Michael Widenius’s daughter My, and “SQL”, the abbreviation for Structured Query Language. The database is play role as Load in ETL

Jupyter Notebook

Jupyter Notebook (formerly IPython Notebook) is a web-based interactive computational environment for creating notebook documents. Jupyter Notebook is built using several open-source libraries, including IPython, ZeroMQ, Tornado, jQuery, Bootstrap, and MathJax. A Jupyter Notebook document is a browser-based REPL containing an ordered list of input/output cells which can contain code, text (using Markdown), mathematics, plots and rich media. Underneath the interface, a notebook is a JSON document, following a versioned schema, usually ending with the “.ipynb” extension. The Jupyter Notebook role is as Extract and Transform in ETL

Metabase

Metabase is the easy, open-source way to help everyone in your company work with data like an analyst. Its like a database client Interface but with extra intelligence that can visualize the dataset.

Codebase Explanation

For the full code you can find in Github repository : https://github.com/pramudityad/assignment-data-science-unstructured

ETL Backend

import pandas as pd

pd.options.display.max_columns = None

import re

import os

import time

from tqdm import tqdm

for dirname, _, filenames in os.walk('/home/damar.pramuditya/School/Data Science/unstructure_assignment/input/archive'):

for filename in filenames:

print(os.path.join(dirname, filename))

/home/damar.pramuditya/School/Data Science/unstructure_assignment/input/archive/client_hostname.csv

/home/damar.pramuditya/School/Data Science/unstructure_assignment/input/archive/access.log

!ls -lsSh 'input/archive/access.log'

3,3G -rw-rw-r– 1 damar.pramuditya damar.pramuditya 3,3G Feb 13 2021 input/archive/access.log

!head -n 4 'input/archive/access.log'

54.36.149.41 - - [22/Jan/2019:03:56:14 +0330] “GET /filter/27

13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27

%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1” 200 30577 “-“ “Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)” “-“

31.56.96.51 - - [22/Jan/2019:03:56:16 +0330] “GET /image/60844/productModel/200x200 HTTP/1.1” 200 5667 “https://www.zanbil.ir/m/filter/b113” “Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build/HuaweiALE-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.158 Mobile Safari/537.36” “-“

31.56.96.51 - - [22/Jan/2019:03:56:16 +0330] “GET /image/61474/productModel/200x200 HTTP/1.1” 200 5379 “https://www.zanbil.ir/m/filter/b113” “Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build/HuaweiALE-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.158 Mobile Safari/537.36” “-“

40.77.167.129 - - [22/Jan/2019:03:56:17 +0330] “GET /image/14925/productModel/100x100 HTTP/1.1” 200 1696 “-“ “Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)” “-“

# Log Format

# This approach assumes the common log format and/or the combined one, which are two of the most commonly used. Eventually other formats can be incorporated. We start with the below regular express taken from:

# Regular Expressions Cookbook

# by Jan Goyvaerts, Steven Levithan

# Publisher: O'Reilly Media, Inc. Release Date: August 2012

combined_regex = '^(?P<client>\S+) \S+ (?P<userid>\S+) \[(?P<datetime>[^\]]+)\] "(?P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (?P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)'

columns = ['client', 'userid', 'datetime', 'method', 'request', 'status', 'size', 'referer', 'user_agent']

# The Approach

# Loop through the lines of the input log file one by one. This ensures minimal memory consumption.

# For each line, check it against the regular expression, and process it:

# Match: append the matched line to a parsed_lines list

# No match: append the non-matching line to the errors_file

# Once parsed_lines reaches 250,000 elements, convert the list to a DataFrame and save it to a parquet file in the output_dir. Clear the list. This also ensures minimal memory usage, and the 250k can be tweaked if necessary.

# Read all the files of the output_dir with read_parquet into a pandas DataFrame. This function handles reading all the files and combines them.

# Optimize the columns by using more efficient data types, most notably the pandas categorical type.

# Write the DataFrame to a single file, for more convenient handling, and with the more efficient datatypes. This results in even faster reading.

# Delete the files in output_dir.

# Read in the final file with read_parquet.

# Start analyzing.

import time

import re

import pandas as pd

def logs_to_df(logfile, output_dir, errors_file):

with open(logfile) as source_file:

linenumber = 0

parsed_lines = []

for line in tqdm(source_file):

try:

log_line = re.findall(combined_regex, line)[0]

parsed_lines.append(log_line)

except Exception as e:

with open(errors_file, 'at') as errfile:

print((line, str(e)), file=errfile)

continue

linenumber += 1

if linenumber % 250_000 == 0:

df = pd.DataFrame(parsed_lines, columns=columns)

df.to_parquet(f'{output_dir}/file_{linenumber}.parquet')

parsed_lines.clear()

else:

df = pd.DataFrame(parsed_lines, columns=columns)

df.to_parquet(f'{output_dir}/file_{linenumber}.parquet')

parsed_lines.clear()

%%capture

!pip install advertools

# Times will vary from system to system

%time logs_to_df(logfile='input/archive/access.log', output_dir='output/', errors_file='errors.txt')

10365152it [01:09, 149847.68it/s]

CPU times: user 49.8 s, sys: 4.89 s, total: 54.7 s

Wall time: 1min 9s

# check errors

!wc errors.txt

589 14416 461229 errors.txt

%time logs_df = pd.read_parquet('output/')

CPU times: user 7.65 s, sys: 1.96 s, total: 9.61 s

Wall time: 4.72 s


# Reading the whole directory takes about 7 seconds.

!du -sh output/

257M output/

# 257 ÷ 3,300 = 0.07.

# The resulting file is 7% the size of the original.

# Let's see how much memory it takes:

val = 257/3300

print(val)

0.07787878787878788

logs_df.info(show_counts=True, verbose=True)

RangeIndex: 10364865 entries, 0 to 10364864

Data columns (total 9 columns):

Column Non-Null Count Dtype

0 client 10364865 non-null object

1 userid 10364865 non-null object

2 datetime 10364865 non-null object

3 method 10364865 non-null object

4 request 10364865 non-null object

5 status 10364865 non-null object

6 size 10364865 non-null object

7 referer 10364865 non-null object

8 user_agent 10364865 non-null object

dtypes: object(9)

memory usage: 711.7+ MB

# 711 MB. We now remove the files in output dir and optimize the datatypes and use more efficient ones.

# %rm -r output/


# optimized by changing the type data and take out userid field

logs_df['client'] = logs_df['client'].astype('category')

del logs_df['userid']

logs_df['datetime'] = pd.to_datetime(logs_df['datetime'], format='%d/%b/%Y:%H:%M:%S %z')

logs_df['method'] = logs_df['method'].astype('category')

logs_df['status'] = logs_df['status'].astype('int16')

logs_df['size'] = logs_df['size'].astype('int32')

logs_df['referer'] = logs_df['referer'].astype('category')

logs_df['user_agent'] = logs_df['user_agent'].astype('category')

logs_df.info(verbose=True, show_counts=True)

RangeIndex: 10364865 entries, 0 to 10364864

Data columns (total 8 columns):

Column Non-Null Count Dtype

0 client 10364865 non-null category

1 datetime 10364865 non-null datetime64[ns, pytz.FixedOffset(210)]

2 method 10364865 non-null category

3 request 10364865 non-null object

4 status 10364865 non-null int16

5 size 10364865 non-null int32

6 referer 10364865 non-null category

7 user_agent 10364865 non-null category

dtypes: category(4), datetime64ns, pytz.FixedOffset(210), int16(1), int32(1), object(1)

memory usage: 342.3+ MB

# The file was reduced further from 711 to 342 MB. (342 ÷ 711 = 0.48 of the original size)

# We now save it to a single file, and read again.

%time logs_df.to_parquet('logs_df.parquet')

CPU times: user 3.7 s, sys: 891 ms, total: 4.59 s

Wall time: 4.33 s

# Now reading the file took almost half the previous time.

%time logs_df = pd.read_parquet('logs_df.parquet')

CPU times: user 3.26 s, sys: 1.5 s, total: 4.77 s

Wall time: 3.41 s

logs_df.shape

(10364865, 8)


logs_df

	client	datetime	method	request	status	size	referer	user_agent
0	37.152.163.59	2019-01-22 12:38:27+03:30	GET	/image/29314?name=%D8%AF%DB%8C%D8%A8%D8%A7-7.j...	200	1105	https://www.zanbil.ir/product/29314/%DA%A9%D8%...	Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7....
1	37.152.163.59	2019-01-22 12:38:27+03:30	GET	/static/images/zanbil-kharid.png	200	358	https://www.zanbil.ir/product/29314/%DA%A9%D8%...	Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7....
2	85.9.73.119	2019-01-22 12:38:27+03:30	GET	/static/images/next.png	200	3045	https://znbl.ir/static/bundle-bundle_site_head...	Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...
3	37.152.163.59	2019-01-22 12:38:27+03:30	GET	/image/29314?name=%D8%AF%DB%8C%D8%A8%D8%A7-4.j...	200	1457	https://www.zanbil.ir/product/29314/%DA%A9%D8%...	Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7....
4	85.9.73.119	2019-01-22 12:38:27+03:30	GET	/static/images/checked.png	200	1083	https://znbl.ir/static/bundle-bundle_site_head...	Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...
...	...	...	...	...	...	...	...	...
10364860	86.104.110.254	2019-01-26 16:01:31+03:30	GET	/settings/logo	200	4120	https://www.zanbil.ir/m/browse/tv/%D8%AA%D9%84...	Mozilla/5.0 (iPhone; CPU iPhone OS 12_1 like M...
10364861	5.125.254.169	2019-01-26 16:01:31+03:30	GET	/image/5/brand	200	2171	https://www.zanbil.ir/m/filter/p62%2Cstexists	Mozilla/5.0 (iPhone; CPU iPhone OS 12_1_2 like...
10364862	65.49.68.192	2019-01-26 16:01:31+03:30	GET	/image/64646/productModel/150x150	200	5318	https://www.zanbil.ir/browse/audio-and-video-e...	Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:6...
10364863	5.125.254.169	2019-01-26 16:01:31+03:30	GET	/image/1/brand	200	3924	https://www.zanbil.ir/m/filter/p62%2Cstexists	Mozilla/5.0 (iPhone; CPU iPhone OS 12_1_2 like...
10364864	65.49.68.192	2019-01-26 16:01:31+03:30	GET	/image/56698/productModel/150x150	200	3570	https://www.zanbil.ir/browse/audio-and-video-e...	Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:6...

10364865 rows × 8 columns

!pip install SQLAlchemy

Requirement already satisfied: SQLAlchemy in /home/damar.pramuditya/.pyenv/versions/anaconda3-2022.05/lib/python3.9/site-packages (1.4.32)

Requirement already satisfied: greenlet!=0.4.17 in /home/damar.pramuditya/.pyenv/versions/anaconda3-2022.05/lib/python3.9/site-packages (from SQLAlchemy) (1.1.1)

from sqlalchemy import create_engine

# Connect to the database

engine = create_engine('mysql+pymysql://root:root@localhost:3307/logs')

try:

with engine.begin() as connection:

logs_df.to_sql(name='logs_df', con=engine, if_exists='replace', index=False)

print('Sucessfully written to Database!!!')

except Exception as e:

print(e)

Sucessfully written to Database!!!

Setup MySQL

Using docker to setup a database is chosen because it is a handy and effortless way to set up a database. Run this command in your terminal

docker run -p 3307:3306 –hostname localhost –name data-science-asg -e MYSQL_ROOT_PASSWORD=root -d mysql

For load into database MySQL using library SQLAlchemy. SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. It provides a full suite of well known enterprise-level persistence patterns, designed for efficient and high-performing database access, adapted into a simple and Pythonic domain language.

from sqlalchemy import create_engine

Connect to the database

engine = create_engine('mysql+pymysql://root:root@localhost:3307/logs')

try:

    with engine.begin() as connection:

        logs_df.to_sql(name='logs_df', con=engine, if_exists='replace', index=False)

        print('Sucessfully written to Database!!!')

except Exception as e:

    print(e)

Visualize Data

As the final process transform data from unstructured data to structured data, In this study will provide frontend tools to serve the data, we choose open source tools:

Setup Metabase

We choose docker as well for setting up these tools.

Use this quick start to run the Open Source version of Metabase locally. See below for instructions on running Metabase in production.Assuming you have Docker installed and running, get the latest Docker image:

docker pull metabase/metabase:latest

Then start the Metabase container:

docker run -d -p 12345:3000 --name metabase metabase/metabase

Connect to your data source

Save and your dashboard should be ready.

Twitter Facebook LinkedIn

Implementation : History Log Terminal Analysis

Implementation

Stack Use

Mysql

Jupyter Notebook

Metabase

Codebase Explanation

ETL Backend

Column Non-Null Count Dtype

Column Non-Null Count Dtype

Setup MySQL

Connect to the database

Visualize Data

Setup Metabase

Connect to your data source

More from Damar

Whats in My Iphone

Monday is Playlist Day

Contemplation

Tidak terlalu buruk