This is my solution to the homework 2 of 95851 Data Science for Product Managers. I believe it is a good chance to apply easy classification model to predict bank failure.

Dataset provides a time series snapshot for 396 banks status, with indicators of financial health.We care about what factors lead to the failure and can we predict the failfure in the future?

Import Dataset

# import necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

bank = pd.read_excel('Bank failure data.xlsx')

bank.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4060 entries, 0 to 4059
Data columns (total 14 columns):
Bank Name                         4060 non-null object
Quarter                           4060 non-null object
Tier One                          4060 non-null float64
Texas                             3997 non-null float64
Size                              4060 non-null float64
Brokered Deposits                 4040 non-null float64
Net Chargeoffs                    4054 non-null float64
Constr and Land Dev Loans         4060 non-null float64
Change in Portfolio Mix           4060 non-null float64
NP CRE to Assets                  4060 non-null float64
Volatile Liabilities to Assets    4060 non-null float64
Securities                        4060 non-null float64
Failed during 2010Q2              4060 non-null object
Cert Number                       4060 non-null int64
dtypes: float64(10), int64(1), object(3)
memory usage: 444.1+ KB
bank.head()
Bank Name Quarter Tier One Texas Size Brokered Deposits Net Chargeoffs Constr and Land Dev Loans Change in Portfolio Mix NP CRE to Assets Volatile Liabilities to Assets Securities Failed during 2010Q2 Cert Number
0 Exchange Bank 2007Q4 14.90 19.36 32.852108 0.0 0.03 23.13 3.38 0.190681 20.16 99.07 No 160
1 Exchange Bank 2008Q1 14.30 20.86 33.542390 0.0 0.00 32.96 4.96 0.000000 21.23 99.45 No 160
2 Exchange Bank 2008Q2 14.15 20.89 34.140007 0.0 0.31 33.71 1.53 0.022408 19.69 97.94 No 160
3 Exchange Bank 2008Q3 14.13 18.74 34.038758 0.0 -0.02 34.99 3.80 0.147452 19.83 98.84 No 160
4 Exchange Bank 2008Q4 14.21 21.82 34.059328 0.0 1.21 37.14 3.86 0.057306 15.29 99.84 No 160
len(bank['Bank Name'].unique())
396

What are the top two predictors of bank failure?

First assume Whether a bank will collapse depends on its financial status on 2010Q1. We removed all the rows from the previous quarters. 388 bank info is collected and then the dataset is splitted into training dataset and test dataset. In this case, I choose decision tree classifier. To get the best hyperparameters, 10-fold cross validation is applied. Then the classifier is applied to entire training dataset and evaluted by the test dataset.

The top two predictors of bank failure is Tier One and Change in Portfolio Mix.

Q1=bank[(bank['Quarter']=='2010Q1')]
Q1.drop('Cert Number',axis=1,inplace=True)
Q1.dropna(inplace=True)

There are 388 effective bank info at last.

Q1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 388 entries, 9 to 4049
Data columns (total 13 columns):
Bank Name                         388 non-null object
Quarter                           388 non-null object
Tier One                          388 non-null float64
Texas                             388 non-null float64
Size                              388 non-null float64
Brokered Deposits                 388 non-null float64
Net Chargeoffs                    388 non-null float64
Constr and Land Dev Loans         388 non-null float64
Change in Portfolio Mix           388 non-null float64
NP CRE to Assets                  388 non-null float64
Volatile Liabilities to Assets    388 non-null float64
Securities                        388 non-null float64
Failed during 2010Q2              388 non-null object
dtypes: float64(10), object(3)
memory usage: 42.4+ KB

The dataset is a little inbalanced.

Q1["Failed during 2010Q2"].value_counts()
No     355
Yes     33
Name: Failed during 2010Q2, dtype: int64
Q1=Q1.reset_index().drop('index',axis=1)

train-test split

Here, considering the fewer number of Positive (“Yes”) cases, we split data in a stratified way.

First, split data into training and test data. We run decision tree classifier and tune hyperparamters with 10-fold cv on training set and evaluate it on the test set.

from sklearn.model_selection import train_test_split
X=Q1.drop(['Bank Name','Quarter','Failed during 2010Q2'],axis=1)
#X=X[X.columns[dt.feature_importances_>0]]
y=Q1['Failed during 2010Q2']

from imblearn.over_sampling import RandomOverSampler
#ros = RandomOverSampler(random_state=0)
#X_resampled, y_resampled = ros.fit_sample(X, y)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   stratify=y,
                                                    test_size=0.25,random_state=25)


pd.Series(y_train).value_counts()
No     266
Yes     25
Name: Failed during 2010Q2, dtype: int64
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
X_train=X_train.reset_index().drop('index',axis=1)
y_train=y_train.reset_index().drop('index',axis=1)

Tune hyperparameters for decision tree classifiers

Considering the inbalanced dataset, we splited with stratfied methods.

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=10,random_state=25)
#skf.get_n_splits(X_train, y_train)
for maxd in range(1,2):
    accurate=[]
    for train_index, test_index in skf.split(X_train, y_train):
        X_Train, X_dev = X_train.iloc[train_index], X_train.iloc[test_index]
        y_Train, y_dev = y_train.iloc[train_index], y_train.iloc[test_index]
        dt = DecisionTreeClassifier(criterion="entropy",max_features=3,max_depth=2,random_state=25)
        dt.fit(X_Train,y_Train)
        pred=dt.predict(X_dev)
        accurate.append(accuracy_score(y_dev, pred))
    print(np.mean(accurate))
0.986075533662

Train the model on training data and predict on test data.

Accuracy is not adequate for inbalanced data. So we print out the classification report and confusion matrix to see how it performs on both classes respectively. The evaluation is ok since only one bank was mislabelled as “yes”. The accuracy is great and precision on minority class is acceptable.

dt=DecisionTreeClassifier(criterion="entropy",max_features=3,max_depth=2,random_state=25)
dt.fit(X_train,y_train)
print(classification_report(y_test,dt.predict(X_test)))
             precision    recall  f1-score   support

         No       1.00      0.99      0.99        89
        Yes       0.89      1.00      0.94         8

avg / total       0.99      0.99      0.99        97
accuracy_score(y_test,dt.predict(X_test))
0.98969072164948457
confusion_matrix(y_test,dt.predict(X_test))
array([[88,  1],
       [ 0,  8]], dtype=int64)

Find feature importance in decision tree model

dt.feature_importances_
array([ 0.98154267,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.01845733,  0.        ,  0.        ,  0.        ])
X.columns[dt.feature_importances_>0]
Index(['Tier One', 'Change in Portfolio Mix'], dtype='object')

Which banks are most likely to fail in the near future (and why)?

A: Apply the model to predict the survival of bank which did survive during 2010Q2. The model predicts if a bank fails in 2010Q2. Alghough the bank did not fail at that time, it still has high risk of failure. Assume their financial status would not change a lot during time, then its failure is coming soon. Four banks were suggested collapse. They are State Central Bank, Gulf State Community Bank, ShoreBank, Darby Bank & Trust Company.

potential1 = Q1[Q1['Failed during 2010Q2']=="No"].reset_index().drop('index',axis=1)
potential = potential1.drop(['Bank Name','Quarter','Failed during 2010Q2'],axis=1)
potential.head()
Tier One Texas Size Brokered Deposits Net Chargeoffs Constr and Land Dev Loans Change in Portfolio Mix NP CRE to Assets Volatile Liabilities to Assets Securities
0 13.25 54.35 34.285794 0.00 0.12 34.49 1.53 1.770006 21.31 102.34
1 16.64 21.32 48.378092 6.98 0.61 20.27 2.66 4.507692 43.05 100.94
2 14.01 7.34 13.818530 0.00 -0.06 2.38 3.22 0.646950 26.74 103.17
3 16.00 7.04 48.322100 0.00 0.00 2.37 3.93 0.247313 8.21 100.76
4 12.51 39.16 73.719818 19.35 1.12 22.74 6.14 1.664389 11.14 101.76

Find the probability of failure of banks

Find the banks whose predictions of failure is highly probable.

(pd.DataFrame(dt.predict_proba(potential),columns=["No","Yes"])["Yes"]>0.8).sort_values(ascending=False).head()
63      True
236     True
167     True
153     True
354    False
Name: Yes, dtype: bool
potential1["Bank Name"][[63,236,167,153]]
63             State Central Bank
236     Gulf State Community Bank
167                     ShoreBank
153    Darby Bank & Trust Company
Name: Bank Name, dtype: object

State Central Bank, Gulf State Community Bank(closed on Nov. 2010), ShoreBank(closed on Aug. 2010), Darby Bank & Trust Company(closed on Nov. 2010)

The prediction is pretty good. Adjusting the threshold, more banks will come out.