Credit Default Risk Analysis-Understanding and Predicting Customer Eligibility for getting Loan

6 min readJul 4, 2021

Let us first understand what do we mean by a Bank Loan. It is the extension of money from a bank to another party with the agreement that the money will be repaid.

Objective and need of the aforesaid analysis

The main goal of credit analysis is to determine the creditworthiness of potential borrowers and their ability to honor their debt obligations. If the borrower presents an acceptable level of default risk, the analyst can recommend the approval of the credit application at the agreed terms. The outcome of the credit risk analysis determines the risk rating that the borrower will be assigned and their ability to access credit. Without doing proper risk analysis, the Bank will probably adapt poor lending practices, which will result in serious losses for the bank.
One of our objective of the case study is to classify the potential borrowers into two categories i.e., Yes or No, based the borrower’s details provided to us.

Yes — Eligible for getting a loan
No- Not eligible for getting a loan

Now, apart from the aforesaid classification, in real life scenarios,while building credit risk models, one of the most important activities performed by banks is to predict the probability of default. Default is the event that a loan borrower will default on his payment obligation during the duration of the loan. The probability of default (PD) is the likelihood of default, that is, the likelihood that the borrower will default on his obligations during the given time period.
So, we will be using Logistic Regression to model the probability of default.

Understanding the Dataset

Data Source

Loan Predication | Kaggle

Loading the dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn as sns
import statsmodels.api as sm
from sklearn.metrics import f1_score
from sklearn import metrics
from sklearn.model_selection import train_test_split
data=pd.DataFrame(pd.read_csv(“C:/Users/admin/Downloads/loan_data_se  t — loan_data_set.csv”))
data.head()

Checking the variable types

data.info()

Shape of the dataset

data.shape(614, 13) # Dataset consists of 614 rows and 13 columns

Now, let us drop the Loan Id column as it has no importance in the analysis

data1=data.drop(‘Loan_ID’,axis=1)

Dataset Summary

data1.describe()

So ,we have got a basic understanding of the dataset, now we will check if there is any missing /null values in the dataset, and handle them accordingly.

Missing Values

data1.isnull().sum()#checking the missing values

As most of the missing values are categorical, so it seems to be a good method to replace the missing values by their corresponding columns’ mode.

#replacing the missing values
for i in range(data1.shape[1]):
 data1.iloc[:,i] = data1.iloc[:,i].fillna(data1.iloc[:,i].mode()[0])

EDA

Univariate EDA

#Univariate EDA
from pandas_profiling import ProfileReport
profile=ProfileReport(data1,’Data Profile’) #Creates a summarised profile of the dataset
profile.to_notebook_iframe()sns.distplot(data1.iloc[:,7],kde=True)

Bivariate EDA

t= [i for i in range(data1.shape[1]) if data1.iloc[:,i].dtype==’O’] #bivariate eda
i=0
j=0
fig, axs = plt.subplots(2,3, figsize=(30,20))
for k in t:
 ct=pd.crosstab(data1.iloc[:,k],data1[‘Loan_Status’])
 ct.plot(kind=’bar’,stacked=True, ax=axs[i][j])
 j+=1
 if j%3==0:
 i+=1
 j=0
plt.show()

The bar plots gives us an idea about the sub categories of each categorical variables and what amount of them were eligible to get the loan.

Multivariate EDA

#multivariate eda
from pandas.plotting import scatter_matrix
scatter_matrix(data1,figsize=(15,9))

From the scatter plots, we can see that there may be several outliers in our dataset.

Outlier Detection

mean=[]
std=[]
i=0
for i in [5,6,7]:
 mean.append(data1.iloc[:,i].mean())
 std.append(data1.iloc[:,i].std())k=0
outlier=[]
for j in [5,6,7]:
 z=[]
 z=(data1.iloc[:,j]-mean[k])/std[k]
 k=k+1
 for i in range(data1.shape[0]):
 if z[i] > 3.5:
 outlier.append(i)
outlier

Dropping Outliers

df=data1.drop(outlier,axis=0)
df.shape
(594, 12)

Distribution Plot

#Distplot
fig, axs = plt.subplots(1,3, figsize=(20,3))
i=0
for k in [5,6,7]:
 sns.distplot(data1.iloc[:,k],ax=axs[i])
 i+=1
plt.show()

Now we are interested in checking the inter correlation between the variables

dataplot = sns.heatmap(df.corr(), cmap=”YlGnBu”, annot=True).

Lable Encoding

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for i in O:
 df.iloc[:,i]=le.fit_transform(df.iloc[:,i])
df.head()

Then,we are interested in checking whether the categorical variables have a significant relationship with the target variable.

col=df.columns
from scipy.stats import chi2_contingency
from scipy.stats import chi2
for k in O:
        table=pd.crosstab(df.iloc[:,k],df['Loan_Status'])
        stat, p, dof, expected = chi2_contingency(table)
        print("\n freature name: %s" % col[k])
        alpha = .05
        print('df=%.3f, significance=%.3f, p=%.3f' % (dof,alpha, p))
        if p <= alpha:
            print('Dependent')
        else:
            print('Independent')

Extracting the target variable

# First extract the target variable status
Y = df.Loan_Status.values
# Drop status from the dataframe and store in X
X=df.drop([‘Loan_Status’],axis=1).values
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)

Fitting the model

# building the model and fitting the data
logit_model = sm.Logit(Y_train, X_train).fit()

Model Summary

logit_model.summary()

After fitting the model, we are interested in checking the probability of paying the loan for the borrowers belonging to test dataset.

yhat = logit_model.predict(X_test)
pd=1-yhat
pd

Here, we have set the threshold as 0.5,i.e.,if the probability is greater than 0.5,we will consider the borrower as non defaulter and eligible to have a loan.

Classification

prediction = list(map(round, yhat))
prediction[1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1]

Model Accuracy

from sklearn.metrics import (confusion_matrix, accuracy_score)
 
# confusion matrix
cm = confusion_matrix(Y_test, prediction) 
print (“Confusion Matrix : \n”, cm) 
 
# accuracy score of the model
print(‘Test accuracy = ‘, accuracy_score(Y_test, prediction))

Conclusion

So, after performing the analysis, we are now able to check whether a new borrower is eligible to get a loan depending on the details provided, and are also able to model the client’s probablity of default.