Implementation : History Log Terminal Analysis
Implementation
In this study we focused on how the input data being processed and finally came as output. The input data is unstructured and the expected output is structured data. Given the Design system above using ETL in this implementation phase divided into 2 main parts : backend and frontend. Before that will be describe each technology stack use in this implementation.
Stack Use
Mysql
The database served is MySQL. MySQL is an open-source relational database management system. Its name is a combination of “My”, the name of co-founder Michael Widenius’s daughter My, and “SQL”, the abbreviation for Structured Query Language. The database is play role as Load in ETL
Jupyter Notebook
Jupyter Notebook (formerly IPython Notebook) is a web-based interactive computational environment for creating notebook documents. Jupyter Notebook is built using several open-source libraries, including IPython, ZeroMQ, Tornado, jQuery, Bootstrap, and MathJax. A Jupyter Notebook document is a browser-based REPL containing an ordered list of input/output cells which can contain code, text (using Markdown), mathematics, plots and rich media. Underneath the interface, a notebook is a JSON document, following a versioned schema, usually ending with the “.ipynb” extension. The Jupyter Notebook role is as Extract and Transform in ETL
Metabase
Metabase is the easy, open-source way to help everyone in your company work with data like an analyst. Its like a database client Interface but with extra intelligence that can visualize the dataset.
Codebase Explanation
For the full code you can find in Github repository : https://github.com/pramudityad/assignment-data-science-unstructured
ETL Backend
import pandas as pd
pd.options.display.max_columns = None
import re
import os
import time
from tqdm import tqdm
for dirname, _, filenames in os.walk('/home/damar.pramuditya/School/Data Science/unstructure_assignment/input/archive'):
for filename in filenames:
print(os.path.join(dirname, filename))
/home/damar.pramuditya/School/Data Science/unstructure_assignment/input/archive/client_hostname.csv
/home/damar.pramuditya/School/Data Science/unstructure_assignment/input/archive/access.log
!ls -lsSh 'input/archive/access.log'
3,3G -rw-rw-r– 1 damar.pramuditya damar.pramuditya 3,3G Feb 13 2021 input/archive/access.log
!head -n 4 'input/archive/access.log'
54.36.149.41 - - [22/Jan/2019:03:56:14 +0330] “GET /filter/27 | 13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27 | %DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1” 200 30577 “-“ “Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)” “-“ |
31.56.96.51 - - [22/Jan/2019:03:56:16 +0330] “GET /image/60844/productModel/200x200 HTTP/1.1” 200 5667 “https://www.zanbil.ir/m/filter/b113” “Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build/HuaweiALE-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.158 Mobile Safari/537.36” “-“
31.56.96.51 - - [22/Jan/2019:03:56:16 +0330] “GET /image/61474/productModel/200x200 HTTP/1.1” 200 5379 “https://www.zanbil.ir/m/filter/b113” “Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build/HuaweiALE-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.158 Mobile Safari/537.36” “-“
40.77.167.129 - - [22/Jan/2019:03:56:17 +0330] “GET /image/14925/productModel/100x100 HTTP/1.1” 200 1696 “-“ “Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)” “-“
# Log Format
# This approach assumes the common log format and/or the combined one, which are two of the most commonly used. Eventually other formats can be incorporated. We start with the below regular express taken from:
# Regular Expressions Cookbook
# by Jan Goyvaerts, Steven Levithan
# Publisher: O'Reilly Media, Inc. Release Date: August 2012
combined_regex = '^(?P<client>\S+) \S+ (?P<userid>\S+) \[(?P<datetime>[^\]]+)\] "(?P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (?P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)'
columns = ['client', 'userid', 'datetime', 'method', 'request', 'status', 'size', 'referer', 'user_agent']
# The Approach
# Loop through the lines of the input log file one by one. This ensures minimal memory consumption.
# For each line, check it against the regular expression, and process it:
# Match: append the matched line to a parsed_lines list
# No match: append the non-matching line to the errors_file
# Once parsed_lines reaches 250,000 elements, convert the list to a DataFrame and save it to a parquet file in the output_dir. Clear the list. This also ensures minimal memory usage, and the 250k can be tweaked if necessary.
# Read all the files of the output_dir with read_parquet into a pandas DataFrame. This function handles reading all the files and combines them.
# Optimize the columns by using more efficient data types, most notably the pandas categorical type.
# Write the DataFrame to a single file, for more convenient handling, and with the more efficient datatypes. This results in even faster reading.
# Delete the files in output_dir.
# Read in the final file with read_parquet.
# Start analyzing.
import time
import re
import pandas as pd
def logs_to_df(logfile, output_dir, errors_file):
with open(logfile) as source_file:
linenumber = 0
parsed_lines = []
for line in tqdm(source_file):
try:
log_line = re.findall(combined_regex, line)[0]
parsed_lines.append(log_line)
except Exception as e:
with open(errors_file, 'at') as errfile:
print((line, str(e)), file=errfile)
continue
linenumber += 1
if linenumber % 250_000 == 0:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(f'{output_dir}/file_{linenumber}.parquet')
parsed_lines.clear()
else:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(f'{output_dir}/file_{linenumber}.parquet')
parsed_lines.clear()
%%capture
!pip install advertools
# Times will vary from system to system
%time logs_to_df(logfile='input/archive/access.log', output_dir='output/', errors_file='errors.txt')
10365152it [01:09, 149847.68it/s]
CPU times: user 49.8 s, sys: 4.89 s, total: 54.7 s
Wall time: 1min 9s
# check errors
!wc errors.txt
589 14416 461229 errors.txt
%time logs_df = pd.read_parquet('output/')
CPU times: user 7.65 s, sys: 1.96 s, total: 9.61 s
Wall time: 4.72 s
# Reading the whole directory takes about 7 seconds.
!du -sh output/
257M output/
# 257 ÷ 3,300 = 0.07.
# The resulting file is 7% the size of the original.
# Let's see how much memory it takes:
val = 257/3300
print(val)
0.07787878787878788
logs_df.info(show_counts=True, verbose=True)
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 10364865 entries, 0 to 10364864
Data columns (total 9 columns):
Column Non-Null Count Dtype
0 client 10364865 non-null object
1 userid 10364865 non-null object
2 datetime 10364865 non-null object
3 method 10364865 non-null object
4 request 10364865 non-null object
5 status 10364865 non-null object
6 size 10364865 non-null object
7 referer 10364865 non-null object
8 user_agent 10364865 non-null object
dtypes: object(9)
memory usage: 711.7+ MB
# 711 MB. We now remove the files in output dir and optimize the datatypes and use more efficient ones.
# %rm -r output/
# optimized by changing the type data and take out userid field
logs_df['client'] = logs_df['client'].astype('category')
del logs_df['userid']
logs_df['datetime'] = pd.to_datetime(logs_df['datetime'], format='%d/%b/%Y:%H:%M:%S %z')
logs_df['method'] = logs_df['method'].astype('category')
logs_df['status'] = logs_df['status'].astype('int16')
logs_df['size'] = logs_df['size'].astype('int32')
logs_df['referer'] = logs_df['referer'].astype('category')
logs_df['user_agent'] = logs_df['user_agent'].astype('category')
logs_df.info(verbose=True, show_counts=True)
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 10364865 entries, 0 to 10364864
Data columns (total 8 columns):
Column Non-Null Count Dtype
0 client 10364865 non-null category
1 datetime 10364865 non-null datetime64[ns, pytz.FixedOffset(210)]
2 method 10364865 non-null category
3 request 10364865 non-null object
4 status 10364865 non-null int16
5 size 10364865 non-null int32
6 referer 10364865 non-null category
7 user_agent 10364865 non-null category
dtypes: category(4), datetime64ns, pytz.FixedOffset(210), int16(1), int32(1), object(1)
memory usage: 342.3+ MB
# The file was reduced further from 711 to 342 MB. (342 ÷ 711 = 0.48 of the original size)
# We now save it to a single file, and read again.
%time logs_df.to_parquet('logs_df.parquet')
CPU times: user 3.7 s, sys: 891 ms, total: 4.59 s
Wall time: 4.33 s
# Now reading the file took almost half the previous time.
%time logs_df = pd.read_parquet('logs_df.parquet')
CPU times: user 3.26 s, sys: 1.5 s, total: 4.77 s
Wall time: 3.41 s
logs_df.shape
(10364865, 8)
logs_df
client | datetime | method | request | status | size | referer | user_agent | |
---|---|---|---|---|---|---|---|---|
0 | 37.152.163.59 | 2019-01-22 12:38:27+03:30 | GET | /image/29314?name=%D8%AF%DB%8C%D8%A8%D8%A7-7.j... | 200 | 1105 | https://www.zanbil.ir/product/29314/%DA%A9%D8%... | Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.... |
1 | 37.152.163.59 | 2019-01-22 12:38:27+03:30 | GET | /static/images/zanbil-kharid.png | 200 | 358 | https://www.zanbil.ir/product/29314/%DA%A9%D8%... | Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.... |
2 | 85.9.73.119 | 2019-01-22 12:38:27+03:30 | GET | /static/images/next.png | 200 | 3045 | https://znbl.ir/static/bundle-bundle_site_head... | Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple... |
3 | 37.152.163.59 | 2019-01-22 12:38:27+03:30 | GET | /image/29314?name=%D8%AF%DB%8C%D8%A8%D8%A7-4.j... | 200 | 1457 | https://www.zanbil.ir/product/29314/%DA%A9%D8%... | Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.... |
4 | 85.9.73.119 | 2019-01-22 12:38:27+03:30 | GET | /static/images/checked.png | 200 | 1083 | https://znbl.ir/static/bundle-bundle_site_head... | Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple... |
... | ... | ... | ... | ... | ... | ... | ... | ... |
10364860 | 86.104.110.254 | 2019-01-26 16:01:31+03:30 | GET | /settings/logo | 200 | 4120 | https://www.zanbil.ir/m/browse/tv/%D8%AA%D9%84... | Mozilla/5.0 (iPhone; CPU iPhone OS 12_1 like M... |
10364861 | 5.125.254.169 | 2019-01-26 16:01:31+03:30 | GET | /image/5/brand | 200 | 2171 | https://www.zanbil.ir/m/filter/p62%2Cstexists | Mozilla/5.0 (iPhone; CPU iPhone OS 12_1_2 like... |
10364862 | 65.49.68.192 | 2019-01-26 16:01:31+03:30 | GET | /image/64646/productModel/150x150 | 200 | 5318 | https://www.zanbil.ir/browse/audio-and-video-e... | Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:6... |
10364863 | 5.125.254.169 | 2019-01-26 16:01:31+03:30 | GET | /image/1/brand | 200 | 3924 | https://www.zanbil.ir/m/filter/p62%2Cstexists | Mozilla/5.0 (iPhone; CPU iPhone OS 12_1_2 like... |
10364864 | 65.49.68.192 | 2019-01-26 16:01:31+03:30 | GET | /image/56698/productModel/150x150 | 200 | 3570 | https://www.zanbil.ir/browse/audio-and-video-e... | Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:6... |
10364865 rows × 8 columns
!pip install SQLAlchemy
Requirement already satisfied: SQLAlchemy in /home/damar.pramuditya/.pyenv/versions/anaconda3-2022.05/lib/python3.9/site-packages (1.4.32)
Requirement already satisfied: greenlet!=0.4.17 in /home/damar.pramuditya/.pyenv/versions/anaconda3-2022.05/lib/python3.9/site-packages (from SQLAlchemy) (1.1.1)
from sqlalchemy import create_engine
# Connect to the database
engine = create_engine('mysql+pymysql://root:root@localhost:3307/logs')
try:
with engine.begin() as connection:
logs_df.to_sql(name='logs_df', con=engine, if_exists='replace', index=False)
print('Sucessfully written to Database!!!')
except Exception as e:
print(e)
Sucessfully written to Database!!!
Setup MySQL
Using docker to setup a database is chosen because it is a handy and effortless way to set up a database. Run this command in your terminal
docker run -p 3307:3306 –hostname localhost –name data-science-asg -e MYSQL_ROOT_PASSWORD=root -d mysql
For load into database MySQL using library SQLAlchemy. SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. It provides a full suite of well known enterprise-level persistence patterns, designed for efficient and high-performing database access, adapted into a simple and Pythonic domain language.
from sqlalchemy import create_engine
Connect to the database
engine = create_engine('mysql+pymysql://root:root@localhost:3307/logs')
try:
with engine.begin() as connection:
logs_df.to_sql(name='logs_df', con=engine, if_exists='replace', index=False)
print('Sucessfully written to Database!!!')
except Exception as e:
print(e)
Visualize Data
As the final process transform data from unstructured data to structured data, In this study will provide frontend tools to serve the data, we choose open source tools:
Setup Metabase
We choose docker as well for setting up these tools.
Use this quick start to run the Open Source version of Metabase locally. See below for instructions on running Metabase in production.Assuming you have Docker installed and running, get the latest Docker image:
docker pull metabase/metabase:latest
Then start the Metabase container:
docker run -d -p 12345:3000 --name metabase metabase/metabase
Connect to your data source
Save and your dashboard should be ready.