Bank Marketing Data Set Classification

9 minute read

The aim of this projects is to explain how machine learning can help in a bank marketing campaign.The goal of our classifier is to predict using the logistic regression algorithm if a client may subscribe to a fixed term deposit. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

Data Set Information: source https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

There are four datasets: 1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014] 2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs. 3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs). 4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs). The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Attribute Information:

Input variables: #bank client data: 1 - age (numeric) 2 - job : type of job (categorical: ‘admin.’,’blue-collar’,’entrepreneur’,’housemaid’,’management’,’retired’,’self-employed’,’services’,’student’,’technician’,’unemployed’,’unknown’) 3 - marital : marital status (categorical: ‘divorced’,’married’,’single’,’unknown’; note: ‘divorced’ means divorced or widowed) 4 - education (categorical: ‘basic.4y’,’basic.6y’,’basic.9y’,’high.school’,’illiterate’,’professional.course’,’university.degree’,’unknown’) 5 - default: has credit in default? (categorical: ‘no’,’yes’,’unknown’) 6 - housing: has housing loan? (categorical: ‘no’,’yes’,’unknown’) 7 - loan: has personal loan? (categorical: ‘no’,’yes’,’unknown’) #related with the last contact of the current campaign: 8 - contact: contact communication type (categorical: ‘cellular’,’telephone’) 9 - month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’) 10 - day_of_week: last contact day of the week (categorical: ‘mon’,’tue’,’wed’,’thu’,’fri’) 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. #other attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,’nonexistent’,’success’) #social and economic context attributes 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target): 21 - y - has the client subscribed a term deposit? (binary: ‘yes’,’no’)

Load Dataset

import pandas as pd
import numpy as np

df = pd.read_csv('./bank/bank.csv', sep=';')

print ('Data read into a pandas dataframe!')
Data read into a pandas dataframe!
df.head()
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 30 unemployed married primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
1 33 services married secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure no
2 35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure no
3 30 management married tertiary no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no
4 59 blue-collar married secondary no 0 yes no unknown 5 may 226 1 -1 0 unknown no
df.tail()
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
4516 33 services married secondary no -333 yes no cellular 30 jul 329 5 -1 0 unknown no
4517 57 self-employed married tertiary yes -3313 yes yes unknown 9 may 153 1 -1 0 unknown no
4518 57 technician married secondary no 295 no no cellular 19 aug 151 11 -1 0 unknown no
4519 28 blue-collar married secondary no 1137 no no cellular 6 feb 129 4 211 3 other no
4520 44 entrepreneur single tertiary no 1136 yes yes cellular 3 apr 345 2 249 7 other no
df.shape
(4521, 17)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        4521 non-null   int64 
 1   job        4521 non-null   object
 2   marital    4521 non-null   object
 3   education  4521 non-null   object
 4   default    4521 non-null   object
 5   balance    4521 non-null   int64 
 6   housing    4521 non-null   object
 7   loan       4521 non-null   object
 8   contact    4521 non-null   object
 9   day        4521 non-null   int64 
 10  month      4521 non-null   object
 11  duration   4521 non-null   int64 
 12  campaign   4521 non-null   int64 
 13  pdays      4521 non-null   int64 
 14  previous   4521 non-null   int64 
 15  poutcome   4521 non-null   object
 16  y          4521 non-null   object
dtypes: int64(7), object(10)
memory usage: 600.6+ KB
df.describe().T
count mean std min 25% 50% 75% max
age 4521.0 41.170095 10.576211 19.0 33.0 39.0 49.0 87.0
balance 4521.0 1422.657819 3009.638142 -3313.0 69.0 444.0 1480.0 71188.0
day 4521.0 15.915284 8.247667 1.0 9.0 16.0 21.0 31.0
duration 4521.0 263.961292 259.856633 4.0 104.0 185.0 329.0 3025.0
campaign 4521.0 2.793630 3.109807 1.0 1.0 2.0 3.0 50.0
pdays 4521.0 39.766645 100.121124 -1.0 -1.0 -1.0 -1.0 871.0
previous 4521.0 0.542579 1.693562 0.0 0.0 0.0 0.0 25.0
print("Train Data:")
print(df.isnull().sum(), "\n")
Train Data:
age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64 

replace y column value

df['y'] = df.y.replace({"yes": 1, "no": 0})
# iterated through df and stored data with datatype as 'object' to new variable cat_col
cat_col = [n for n in df.columns if df[n].dtypes == 'object']
df.head()
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 30 unemployed married primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown 0
1 33 services married secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure 0
2 35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure 0
3 30 management married tertiary no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown 0
4 59 blue-collar married secondary no 0 yes no unknown 5 may 226 1 -1 0 unknown 0

analysis on our dependent variable ‘y’ to check for data imbalance

%matplotlib inline 

import matplotlib as mpl
import matplotlib.pyplot as plt
all_row = len(df)
no_sub = len(df[df['y'] == 0])
sub = len(df[df['y']==1])
percentage_no_sub = (no_sub/all_row) * 100
percentage_sub = (sub/all_row) * 100

print('% no subcription: ', percentage_no_sub)
print('% subcription: ', percentage_sub)
my_color = 'rg'
df['y'].value_counts().plot(kind='bar')
% no subcription:  88.47600088476001
% subcription:  11.523999115239992





<matplotlib.axes._subplots.AxesSubplot at 0x7fed81617970>

image-center

from analysis above, we can see majority data is no subcription 88%, dependent variable is no_sub

Visualisation

#Visualization from our categorical datas top see if we can get insigts from there
for col in cat_col:
    pd.crosstab(df[col], df.y).plot(kind = 'bar')
    plt.title(col)

image-center

image-center

image-center

image-center

image-center

image-center

image-center

image-center

image-center

df['pdays'].value_counts()
-1      3705
 182      23
 183      20
 363      12
 92       12
        ... 
 222       1
 210       1
 206       1
 162       1
 28        1
Name: pdays, Length: 292, dtype: int64
df['contact'].value_counts()
cellular     2896
unknown      1324
telephone     301
Name: contact, dtype: int64
# replace contact with value
df['contact'] = df.contact.replace({"cellular": 1, "unknown": 0,"telephone":2})
df = pd.get_dummies(df,columns=['job','marital','education','default','housing','loan','month','poutcome'],drop_first = True)
df.head()
age balance contact day duration campaign pdays previous y job_blue-collar ... month_jul month_jun month_mar month_may month_nov month_oct month_sep poutcome_other poutcome_success poutcome_unknown
0 30 1787 1 19 79 1 -1 0 0 0 ... 0 0 0 0 0 1 0 0 0 1
1 33 4789 1 11 220 1 339 4 0 0 ... 0 0 0 1 0 0 0 0 0 0
2 35 1350 1 16 185 1 330 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 30 1476 0 3 199 4 -1 0 0 0 ... 0 1 0 0 0 0 0 0 0 1
4 59 0 0 5 226 1 -1 0 0 1 ... 0 0 0 1 0 0 0 0 0 1

5 rows × 42 columns

df.shape
(4521, 42)
from sklearn.model_selection import train_test_split
X=df.loc[:,df.columns != 'y']
y=df.loc[:,df.columns == 'y']
x_train, x_cv, y_train, y_cv = train_test_split(X, y, test_size=0.2)
print('x train ',len(x_train))
print('x test ',len(x_cv))
print('y train ',len(y_train))
print('y test ',len(y_cv))
x train  3616
x test  905
y train  3616
y test  905

LOGISTIC REGRESSION

#(a)LOGISTIC REGRESSION

from sklearn.linear_model import LogisticRegression

model=LogisticRegression()

model.fit(x_train,y_train)
/Users/mac/.pyenv/versions/anaconda3-2020.07/lib/python3.8/site-packages/sklearn/utils/validation.py:63: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  return f(*args, **kwargs)
/Users/mac/.pyenv/versions/anaconda3-2020.07/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(





LogisticRegression()
pred_cv=model.predict(x_cv)
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

print(accuracy_score(y_cv,pred_cv))
matrix=confusion_matrix(y_cv,pred_cv)
print(matrix)
0.9005524861878453
[[793   7]
 [ 83  22]]

K-NEAREST NEIGHBOR(kNN) ALGORITHM

#(f)K-NEAREST NEIGHBOR(kNN) ALGORITHM

from sklearn.neighbors import KNeighborsClassifier
kNN=KNeighborsClassifier()
kNN.fit(x_train,y_train)

pred_cv5=kNN.predict(x_cv)

print(accuracy_score(y_cv,pred_cv5))
matrix5=confusion_matrix(y_cv,pred_cv5)
print(matrix5)
/Users/mac/.pyenv/versions/anaconda3-2020.07/lib/python3.8/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  return self._fit(X, y)


0.8784530386740331
[[777  23]
 [ 87  18]]

SUPPORT VECTOR MACHINE (SVM) ALGORITHM

from sklearn import svm
svm_model=svm.SVC()
svm_model.fit(x_train,y_train)
pred_cv3=svm_model.predict(x_cv)
print(accuracy_score(y_cv,pred_cv3))
matrix3=confusion_matrix(y_cv,pred_cv3)
print(matrix3)
/Users/mac/.pyenv/versions/anaconda3-2020.07/lib/python3.8/site-packages/sklearn/utils/validation.py:63: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  return f(*args, **kwargs)


0.8839779005524862
[[800   0]
 [105   0]]

DECISION TREE ALGORITHM

from sklearn import tree
dt=tree.DecisionTreeClassifier()
dt.fit(x_train,y_train)
pred_cv1=dt.predict(x_cv)
print(accuracy_score(y_cv,pred_cv1))
matrix1=confusion_matrix(y_cv,pred_cv1)
print(matrix1)
0.8895027624309392
[[750  50]
 [ 50  55]]

RANDOM FOREST ALGORITHM

from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()
rf.fit(x_train,y_train)
pred_cv2=rf.predict(x_cv)
print(accuracy_score(y_cv,pred_cv2))
matrix2=confusion_matrix(y_cv,pred_cv2)
print(matrix2)
<ipython-input-80-d123f799bf67>:3: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  rf.fit(x_train,y_train)


0.8983425414364641
[[791   9]
 [ 83  22]]

NAIVE BAYES ALGORITHM


from sklearn.naive_bayes import GaussianNB 
nb=GaussianNB()
nb.fit(x_train,y_train)

pred_cv4=nb.predict(x_cv)

print(accuracy_score(y_cv,pred_cv4))
matrix4=confusion_matrix(y_cv,pred_cv4)
print(matrix4)
0.8640883977900552
[[725  75]
 [ 48  57]]


/Users/mac/.pyenv/versions/anaconda3-2020.07/lib/python3.8/site-packages/sklearn/utils/validation.py:63: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  return f(*args, **kwargs)
print("Logistic Regression:", accuracy_score(y_cv,pred_cv))
print("Decision Tree:", accuracy_score(y_cv,pred_cv1))
print("Random Forest:", accuracy_score(y_cv,pred_cv2))
print("SVM:", accuracy_score(y_cv,pred_cv3))
print("Naive Bayes:", accuracy_score(y_cv,pred_cv4))
print("KNN:", accuracy_score(y_cv,pred_cv5))
Logistic Regression: 0.9005524861878453
Decision Tree: 0.8895027624309392
Random Forest: 0.8983425414364641
SVM: 0.8839779005524862
Naive Bayes: 0.8640883977900552
KNN: 0.8784530386740331