This is my solution to the homework 2 of 95851 Data Science for Product Managers. I believe it is a good chance to apply easy classification model to predict bank failure.
Dataset provides a time series snapshot for 396 banks status, with indicators of financial health.We care about what factors lead to the failure and can we predict the failfure in the future?
Import Dataset
# import necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
bank = pd.read_excel('Bank failure data.xlsx')
bank.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4060 entries, 0 to 4059
Data columns (total 14 columns):
Bank Name 4060 non-null object
Quarter 4060 non-null object
Tier One 4060 non-null float64
Texas 3997 non-null float64
Size 4060 non-null float64
Brokered Deposits 4040 non-null float64
Net Chargeoffs 4054 non-null float64
Constr and Land Dev Loans 4060 non-null float64
Change in Portfolio Mix 4060 non-null float64
NP CRE to Assets 4060 non-null float64
Volatile Liabilities to Assets 4060 non-null float64
Securities 4060 non-null float64
Failed during 2010Q2 4060 non-null object
Cert Number 4060 non-null int64
dtypes: float64(10), int64(1), object(3)
memory usage: 444.1+ KB
bank.head()
Bank Name | Quarter | Tier One | Texas | Size | Brokered Deposits | Net Chargeoffs | Constr and Land Dev Loans | Change in Portfolio Mix | NP CRE to Assets | Volatile Liabilities to Assets | Securities | Failed during 2010Q2 | Cert Number | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Exchange Bank | 2007Q4 | 14.90 | 19.36 | 32.852108 | 0.0 | 0.03 | 23.13 | 3.38 | 0.190681 | 20.16 | 99.07 | No | 160 |
1 | Exchange Bank | 2008Q1 | 14.30 | 20.86 | 33.542390 | 0.0 | 0.00 | 32.96 | 4.96 | 0.000000 | 21.23 | 99.45 | No | 160 |
2 | Exchange Bank | 2008Q2 | 14.15 | 20.89 | 34.140007 | 0.0 | 0.31 | 33.71 | 1.53 | 0.022408 | 19.69 | 97.94 | No | 160 |
3 | Exchange Bank | 2008Q3 | 14.13 | 18.74 | 34.038758 | 0.0 | -0.02 | 34.99 | 3.80 | 0.147452 | 19.83 | 98.84 | No | 160 |
4 | Exchange Bank | 2008Q4 | 14.21 | 21.82 | 34.059328 | 0.0 | 1.21 | 37.14 | 3.86 | 0.057306 | 15.29 | 99.84 | No | 160 |
len(bank['Bank Name'].unique())
396
What are the top two predictors of bank failure?
First assume Whether a bank will collapse depends on its financial status on 2010Q1. We removed all the rows from the previous quarters. 388 bank info is collected and then the dataset is splitted into training dataset and test dataset. In this case, I choose decision tree classifier. To get the best hyperparameters, 10-fold cross validation is applied. Then the classifier is applied to entire training dataset and evaluted by the test dataset.
The top two predictors of bank failure is Tier One and Change in Portfolio Mix.
Q1=bank[(bank['Quarter']=='2010Q1')]
Q1.drop('Cert Number',axis=1,inplace=True)
Q1.dropna(inplace=True)
There are 388 effective bank info at last.
Q1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 388 entries, 9 to 4049
Data columns (total 13 columns):
Bank Name 388 non-null object
Quarter 388 non-null object
Tier One 388 non-null float64
Texas 388 non-null float64
Size 388 non-null float64
Brokered Deposits 388 non-null float64
Net Chargeoffs 388 non-null float64
Constr and Land Dev Loans 388 non-null float64
Change in Portfolio Mix 388 non-null float64
NP CRE to Assets 388 non-null float64
Volatile Liabilities to Assets 388 non-null float64
Securities 388 non-null float64
Failed during 2010Q2 388 non-null object
dtypes: float64(10), object(3)
memory usage: 42.4+ KB
The dataset is a little inbalanced.
Q1["Failed during 2010Q2"].value_counts()
No 355
Yes 33
Name: Failed during 2010Q2, dtype: int64
Q1=Q1.reset_index().drop('index',axis=1)
train-test split
Here, considering the fewer number of Positive (“Yes”) cases, we split data in a stratified way.
First, split data into training and test data. We run decision tree classifier and tune hyperparamters with 10-fold cv on training set and evaluate it on the test set.
from sklearn.model_selection import train_test_split
X=Q1.drop(['Bank Name','Quarter','Failed during 2010Q2'],axis=1)
#X=X[X.columns[dt.feature_importances_>0]]
y=Q1['Failed during 2010Q2']
from imblearn.over_sampling import RandomOverSampler
#ros = RandomOverSampler(random_state=0)
#X_resampled, y_resampled = ros.fit_sample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.25,random_state=25)
pd.Series(y_train).value_counts()
No 266
Yes 25
Name: Failed during 2010Q2, dtype: int64
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
X_train=X_train.reset_index().drop('index',axis=1)
y_train=y_train.reset_index().drop('index',axis=1)
Tune hyperparameters for decision tree classifiers
Considering the inbalanced dataset, we splited with stratfied methods.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=10,random_state=25)
#skf.get_n_splits(X_train, y_train)
for maxd in range(1,2):
accurate=[]
for train_index, test_index in skf.split(X_train, y_train):
X_Train, X_dev = X_train.iloc[train_index], X_train.iloc[test_index]
y_Train, y_dev = y_train.iloc[train_index], y_train.iloc[test_index]
dt = DecisionTreeClassifier(criterion="entropy",max_features=3,max_depth=2,random_state=25)
dt.fit(X_Train,y_Train)
pred=dt.predict(X_dev)
accurate.append(accuracy_score(y_dev, pred))
print(np.mean(accurate))
0.986075533662
Train the model on training data and predict on test data.
Accuracy is not adequate for inbalanced data. So we print out the classification report and confusion matrix to see how it performs on both classes respectively. The evaluation is ok since only one bank was mislabelled as “yes”. The accuracy is great and precision on minority class is acceptable.
dt=DecisionTreeClassifier(criterion="entropy",max_features=3,max_depth=2,random_state=25)
dt.fit(X_train,y_train)
print(classification_report(y_test,dt.predict(X_test)))
precision recall f1-score support
No 1.00 0.99 0.99 89
Yes 0.89 1.00 0.94 8
avg / total 0.99 0.99 0.99 97
accuracy_score(y_test,dt.predict(X_test))
0.98969072164948457
confusion_matrix(y_test,dt.predict(X_test))
array([[88, 1],
[ 0, 8]], dtype=int64)
Find feature importance in decision tree model
dt.feature_importances_
array([ 0.98154267, 0. , 0. , 0. , 0. ,
0. , 0.01845733, 0. , 0. , 0. ])
X.columns[dt.feature_importances_>0]
Index(['Tier One', 'Change in Portfolio Mix'], dtype='object')
Which banks are most likely to fail in the near future (and why)?
A: Apply the model to predict the survival of bank which did survive during 2010Q2. The model predicts if a bank fails in 2010Q2. Alghough the bank did not fail at that time, it still has high risk of failure. Assume their financial status would not change a lot during time, then its failure is coming soon. Four banks were suggested collapse. They are State Central Bank, Gulf State Community Bank, ShoreBank, Darby Bank & Trust Company.
potential1 = Q1[Q1['Failed during 2010Q2']=="No"].reset_index().drop('index',axis=1)
potential = potential1.drop(['Bank Name','Quarter','Failed during 2010Q2'],axis=1)
potential.head()
Tier One | Texas | Size | Brokered Deposits | Net Chargeoffs | Constr and Land Dev Loans | Change in Portfolio Mix | NP CRE to Assets | Volatile Liabilities to Assets | Securities | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 13.25 | 54.35 | 34.285794 | 0.00 | 0.12 | 34.49 | 1.53 | 1.770006 | 21.31 | 102.34 |
1 | 16.64 | 21.32 | 48.378092 | 6.98 | 0.61 | 20.27 | 2.66 | 4.507692 | 43.05 | 100.94 |
2 | 14.01 | 7.34 | 13.818530 | 0.00 | -0.06 | 2.38 | 3.22 | 0.646950 | 26.74 | 103.17 |
3 | 16.00 | 7.04 | 48.322100 | 0.00 | 0.00 | 2.37 | 3.93 | 0.247313 | 8.21 | 100.76 |
4 | 12.51 | 39.16 | 73.719818 | 19.35 | 1.12 | 22.74 | 6.14 | 1.664389 | 11.14 | 101.76 |
Find the probability of failure of banks
Find the banks whose predictions of failure is highly probable.
(pd.DataFrame(dt.predict_proba(potential),columns=["No","Yes"])["Yes"]>0.8).sort_values(ascending=False).head()
63 True
236 True
167 True
153 True
354 False
Name: Yes, dtype: bool
potential1["Bank Name"][[63,236,167,153]]
63 State Central Bank
236 Gulf State Community Bank
167 ShoreBank
153 Darby Bank & Trust Company
Name: Bank Name, dtype: object
State Central Bank, Gulf State Community Bank(closed on Nov. 2010), ShoreBank(closed on Aug. 2010), Darby Bank & Trust Company(closed on Nov. 2010)
The prediction is pretty good. Adjusting the threshold, more banks will come out.