Bank Failure Prediction

This is my solution to the homework 2 of 95851 Data Science for Product Managers. I believe it is a good chance to apply easy classification model to predict bank failure.

Dataset provides a time series snapshot for 396 banks status, with indicators of financial health.We care about what factors lead to the failure and can we predict the failfure in the future?

Import Dataset

# import necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

bank = pd.read_excel('Bank failure data.xlsx')

bank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4060 entries, 0 to 4059
Data columns (total 14 columns):
Bank Name                         4060 non-null object
Quarter                           4060 non-null object
Tier One                          4060 non-null float64
Texas                             3997 non-null float64
Size                              4060 non-null float64
Brokered Deposits                 4040 non-null float64
Net Chargeoffs                    4054 non-null float64
Constr and Land Dev Loans         4060 non-null float64
Change in Portfolio Mix           4060 non-null float64
NP CRE to Assets                  4060 non-null float64
Volatile Liabilities to Assets    4060 non-null float64
Securities                        4060 non-null float64
Failed during 2010Q2              4060 non-null object
Cert Number                       4060 non-null int64
dtypes: float64(10), int64(1), object(3)
memory usage: 444.1+ KB

bank.head()

	Bank Name	Quarter	Tier One	Texas	Size	Net Chargeoffs	Constr and Land Dev Loans	Change in Portfolio Mix	NP CRE to Assets	Volatile Liabilities to Assets	Securities	Failed during 2010Q2	Cert Number
0	Exchange Bank	2007Q4	14.90	19.36	32.852108	0.03	23.13	3.38	0.190681	20.16	99.07	No	160
1	Exchange Bank	2008Q1	14.30	20.86	33.542390	0.00	32.96	4.96	0.000000	21.23	99.45	No	160
2	Exchange Bank	2008Q2	14.15	20.89	34.140007	0.31	33.71	1.53	0.022408	19.69	97.94	No	160
3	Exchange Bank	2008Q3	14.13	18.74	34.038758	-0.02	34.99	3.80	0.147452	19.83	98.84	No	160
4	Exchange Bank	2008Q4	14.21	21.82	34.059328	1.21	37.14	3.86	0.057306	15.29	99.84	No	160

len(bank['Bank Name'].unique())

What are the top two predictors of bank failure?

First assume Whether a bank will collapse depends on its financial status on 2010Q1. We removed all the rows from the previous quarters. 388 bank info is collected and then the dataset is splitted into training dataset and test dataset. In this case, I choose decision tree classifier. To get the best hyperparameters, 10-fold cross validation is applied. Then the classifier is applied to entire training dataset and evaluted by the test dataset.

The top two predictors of bank failure is Tier One and Change in Portfolio Mix.

Q1=bank[(bank['Quarter']=='2010Q1')]
Q1.drop('Cert Number',axis=1,inplace=True)
Q1.dropna(inplace=True)

There are 388 effective bank info at last.

Q1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 388 entries, 9 to 4049
Data columns (total 13 columns):
Bank Name                         388 non-null object
Quarter                           388 non-null object
Tier One                          388 non-null float64
Texas                             388 non-null float64
Size                              388 non-null float64
Brokered Deposits                 388 non-null float64
Net Chargeoffs                    388 non-null float64
Constr and Land Dev Loans         388 non-null float64
Change in Portfolio Mix           388 non-null float64
NP CRE to Assets                  388 non-null float64
Volatile Liabilities to Assets    388 non-null float64
Securities                        388 non-null float64
Failed during 2010Q2              388 non-null object
dtypes: float64(10), object(3)
memory usage: 42.4+ KB

The dataset is a little inbalanced.

Q1["Failed during 2010Q2"].value_counts()

No     355
Yes     33
Name: Failed during 2010Q2, dtype: int64

Q1=Q1.reset_index().drop('index',axis=1)

train-test split

Here, considering the fewer number of Positive (“Yes”) cases, we split data in a stratified way.

First, split data into training and test data. We run decision tree classifier and tune hyperparamters with 10-fold cv on training set and evaluate it on the test set.

from sklearn.model_selection import train_test_split
X=Q1.drop(['Bank Name','Quarter','Failed during 2010Q2'],axis=1)
#X=X[X.columns[dt.feature_importances_>0]]
y=Q1['Failed during 2010Q2']

from imblearn.over_sampling import RandomOverSampler
#ros = RandomOverSampler(random_state=0)
#X_resampled, y_resampled = ros.fit_sample(X, y)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   stratify=y,
                                                    test_size=0.25,random_state=25)

pd.Series(y_train).value_counts()

No     266
Yes     25
Name: Failed during 2010Q2, dtype: int64

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X_train=X_train.reset_index().drop('index',axis=1)
y_train=y_train.reset_index().drop('index',axis=1)

Tune hyperparameters for decision tree classifiers

Considering the inbalanced dataset, we splited with stratfied methods.

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=10,random_state=25)
#skf.get_n_splits(X_train, y_train)
for maxd in range(1,2):
    accurate=[]
    for train_index, test_index in skf.split(X_train, y_train):
        X_Train, X_dev = X_train.iloc[train_index], X_train.iloc[test_index]
        y_Train, y_dev = y_train.iloc[train_index], y_train.iloc[test_index]
        dt = DecisionTreeClassifier(criterion="entropy",max_features=3,max_depth=2,random_state=25)
        dt.fit(X_Train,y_Train)
        pred=dt.predict(X_dev)
        accurate.append(accuracy_score(y_dev, pred))
    print(np.mean(accurate))

0.986075533662

Train the model on training data and predict on test data.

Accuracy is not adequate for inbalanced data. So we print out the classification report and confusion matrix to see how it performs on both classes respectively. The evaluation is ok since only one bank was mislabelled as “yes”. The accuracy is great and precision on minority class is acceptable.

dt=DecisionTreeClassifier(criterion="entropy",max_features=3,max_depth=2,random_state=25)
dt.fit(X_train,y_train)
print(classification_report(y_test,dt.predict(X_test)))

             precision    recall  f1-score   support

         No       1.00      0.99      0.99        89
        Yes       0.89      1.00      0.94         8

avg / total       0.99      0.99      0.99        97

accuracy_score(y_test,dt.predict(X_test))

0.98969072164948457

confusion_matrix(y_test,dt.predict(X_test))

array([[88,  1],
       [ 0,  8]], dtype=int64)

Find feature importance in decision tree model

dt.feature_importances_

array([ 0.98154267,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.01845733,  0.        ,  0.        ,  0.        ])

X.columns[dt.feature_importances_>0]

Index(['Tier One', 'Change in Portfolio Mix'], dtype='object')

Which banks are most likely to fail in the near future (and why)?

A: Apply the model to predict the survival of bank which did survive during 2010Q2. The model predicts if a bank fails in 2010Q2. Alghough the bank did not fail at that time, it still has high risk of failure. Assume their financial status would not change a lot during time, then its failure is coming soon. Four banks were suggested collapse. They are State Central Bank, Gulf State Community Bank, ShoreBank, Darby Bank & Trust Company.

potential1 = Q1[Q1['Failed during 2010Q2']=="No"].reset_index().drop('index',axis=1)
potential = potential1.drop(['Bank Name','Quarter','Failed during 2010Q2'],axis=1)
potential.head()

	Tier One	Texas	Size	Brokered Deposits	Net Chargeoffs	Constr and Land Dev Loans	Change in Portfolio Mix	NP CRE to Assets	Volatile Liabilities to Assets	Securities
0	13.25	54.35	34.285794	0.00	0.12	34.49	1.53	1.770006	21.31	102.34
1	16.64	21.32	48.378092	6.98	0.61	20.27	2.66	4.507692	43.05	100.94
2	14.01	7.34	13.818530	0.00	-0.06	2.38	3.22	0.646950	26.74	103.17
3	16.00	7.04	48.322100	0.00	0.00	2.37	3.93	0.247313	8.21	100.76
4	12.51	39.16	73.719818	19.35	1.12	22.74	6.14	1.664389	11.14	101.76

Find the probability of failure of banks

Find the banks whose predictions of failure is highly probable.

(pd.DataFrame(dt.predict_proba(potential),columns=["No","Yes"])["Yes"]>0.8).sort_values(ascending=False).head()

    True
   True
   True
   True
  False
Name: Yes, dtype: bool

potential1["Bank Name"][[63,236,167,153]]

           State Central Bank
   Gulf State Community Bank
                   ShoreBank
  Darby Bank & Trust Company
Name: Bank Name, dtype: object

State Central Bank, Gulf State Community Bank(closed on Nov. 2010), ShoreBank(closed on Aug. 2010), Darby Bank & Trust Company(closed on Nov. 2010)

The prediction is pretty good. Adjusting the threshold, more banks will come out.