In this project, we'll be investigating historical loan data, with the ultimate goal of developing the best tree-based classification model to separate defaults from safe loans. We'll compare the performances of Decision Trees, Random Forests, Gradient Boosted Trees, and Neural Networks.
We'll be using historical personal loan data from LendingClub.com. LendingClub is a P2P lending platform that connects people who need money (borrowers) with people who have money (investors). As an investor, you would want to invest in borrowers who show a profile with a low probability of defaulting.
The data is from 2010-2013. I chose loans from this time range for two reasons. First, the maximum term for LendingClub loans is 5 years, so we would have realized all defaults by the date this data was pulled (early 2020). Second, I wanted to avoid including loans heavily affected by the Great Recession (2007-2009).
There are a few reasons why I decided to do this project.
I used to work as a risk analyst for a marketplace lender that refinanced student loans. In my job, I built the company's underwriting and pricing models. While I evaluated applicants on factors like FICO score, DTI, free cash flow, and delinquencies to approve these loans, many of the cutoffs were predetermined by our financing partners, and I never actually built a predictive ML model.
I'm eager to get a bit more experience working with different tree classifiers. I'd like to improve my understanding of the strengths and weaknesses of each model, and see how each performs under different hyperparameters. Finally, I'd like to create a strategy that identifies the best model for different cost assumptions (cost of misclassifying defaults, cost of misclassifying safe loans).
I'm considering becoming an investor on LendingClub's platform. I'd like to have a perspective on whether a loan will default, and building a classification model with LendingClub's historical loan data will help me do that.
We'll take a systematic approach to exploring our data and developing our models.
If you have any questions, suggestions for improving upon my approach, or just like my work, please don't hesitate to reach out at keilordykengilbert@gmail.com. Having a conversation is the best way for everyone involved to learn & improve 😊
Alright! I'm excited to revisit my old lending stomping grounds, gain experience working with tree-based models, and build the best predictive strategy for my investing needs. Let's get started!!!
# Import useful libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
df = pd.read_csv('loans_2010_to_2013.csv')
df.head()
df.info()
The dataset has over 152 columns, many with duplicative and missing information. Let's just use the following for now:
df = df[['loan_status', 'int_rate', 'grade', 'sub_grade', 'term',
'dti', 'loan_amnt', 'home_ownership', 'installment', 'purpose',
'emp_length', 'delinq_2yrs', 'earliest_cr_line', 'issue_d', 'annual_inc',
'fico_range_low', 'fico_range_high']]
df.head()
Now let's see how many null values we have and how best to address them.
# Create function to show number & share of nulls for each feature
def null_percentage(df):
total = df.isnull().sum().sort_values(ascending = False)
total = total[total != 0]
percent = round(100 * total / len(df), 4)
return pd.concat([total, percent], axis=1, keys=['Total Nulls', 'Percent Null'])
null_percentage(df)
# Let's also visualize the nulls in our table with missingno
import missingno as msno
msno.matrix(df)
plt.show()
Great! Given employment length is categorical, and there may be a meaningful reason for why this information is missing, we will create the 'N/A' type for nulls.
# 'N/A' if null
for i in ['emp_length']:
df[i] = df[i].fillna('N/A')
null_percentage(df)
Using the data in our table, we'll create the following features:
# Create fico
df['fico'] = (df['fico_range_low'] + df['fico_range_high']) / 2
# Create cr_hist_days
df['issue_d'] = pd.to_datetime(df['issue_d'])
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'])
df['cr_hist_days'] = (df['issue_d'] - df['earliest_cr_line']).dt.days
# Drop unnecessary features
df.drop(['issue_d', 'earliest_cr_line', 'fico_range_low', 'fico_range_high'],
axis=1, inplace=True)
# Change delinq_2yrs datatype to int
df['delinq_2yrs'] = df['delinq_2yrs'].astype(int)
df.head()
Let's take a look at the different classes in our target variable, loan_status.
plt.figure(figsize=(12,8))
g = sns.countplot(df['loan_status'],
edgecolor = 'darkslategray',
palette = sns.color_palette('BrBG_r', 7))
g.set_xticklabels(g.get_xticklabels(), rotation=90)
g.set_xlabel(None)
g.set_ylabel('Loan Count')
g.set_title(label = "Loan Status", fontsize=25, fontweight='bold', pad=20)
for p in g.patches:
g.text(p.get_x() + p.get_width() * .5,
p.get_height() + 1000,
'{0:.2%}'.format(p.get_height()/len(df)),
ha = 'center')
plt.show()
Great! We're only interested in loans that were either Fully Paid or Charged Off (i.e. Defaulted). Let's drop the other loan statuses (which represent <0.5% of the data) and change the values of Fully Paid to 1 and Charged Off to 0.
# Drop other loan_status outcomes
desired_statuses = ['Fully Paid', 'Charged Off']
df = df[df['loan_status'].isin(desired_statuses)]
# Change loan_status to 0 (Fully Paid) or 1 (Charged Off)
df['loan_status'] = df['loan_status'].apply(lambda x: 0 if x=='Fully Paid' else 1)
print("{:.0f} loans remain.".format(len(df)))
print()
df['loan_status'].value_counts()
Let's take a look at correlations between our features.
We'll assign numerical values to our non-numeric features using .astype('category').cat.codes. This converts the feature data to a categorical data type, identifies the unique values in lexicographical order, and assigns the integers [0,1,2,...], respectively. For ordinal features like grade, sub_grade, and most of empl_length, this preserves the information contained in the feature's order. This blog post does a good job of explaining how this works in more detail.
plt.figure(figsize=(16, 12))
sns.set_context('paper', font_scale = 1)
sns.heatmap(df.assign(home_ownership = df.home_ownership.astype('category').cat.codes,
purpose = df.purpose.astype('category').cat.codes,
grade = df.grade.astype('category').cat.codes,
sub_grade = df.sub_grade.astype('category').cat.codes,
emp_length = df.emp_length.astype('category').cat.codes,
term = df.term.astype('category').cat.codes).corr(),
annot = True, cmap = 'RdYlGn', vmin = -1, vmax = 1, linewidths = 0.5)
plt.show()
Correlations with Target Variable
Correlations between Features
df.drop(['grade','sub_grade','installment'], axis=1, inplace=True)
Time to explore our data with some visualizations!
Let's start by looking at the distribution of each feature individually. We'll finish up by investigating how loan_status varies across both its most-correlated features as well as nominal features (e.g. purpose).
fig, axes = plt.subplots(10,1,figsize=(16,60))
# List features & titles for each chart
var = ['int_rate', 'dti', 'fico', 'cr_hist_days', 'loan_amnt',
'annual_inc', 'emp_length', 'home_ownership', 'purpose', 'term']
titles = ['Interest Rate', 'Debt to Income', 'Fico Score',
'Length of Credit History (Days)', 'Loan Amount', 'Annual Income',
'Employment Length', 'Home Ownership', 'Purpose', 'Term']
# Graph each feature by enumerating axes and using a for loop
for i, ax in enumerate(axes.flatten()):
if i in [0,1,2,3,4]:
sns.distplot(df[var[i]], ax = ax, bins = 80,
kde_kws = {'color' : 'darkolivegreen',
'label' : 'Kde',
'gridsize' : 1000,
'linewidth' : 3},
hist_kws = {'color' : 'goldenrod',
'label' : "Histogram",
'edgecolor' : 'darkslategray'})
if i in [5]:
sns.boxplot(df[var[i]], ax = ax)
if i == 6:
sns.countplot(df[var[i]], ax = ax,
order = ['N/A', '< 1 year', '1 year', '2 years', '3 years',
'4 years', '5 years', '6 years', '7 years', '8 years',
'9 years', '10+ years'])
if i in [7, 8, 9]:
sns.countplot(df[var[i]], ax = ax, order = df[var[i]].value_counts().index)
if i == 8:
ax.set_xticklabels(ax.get_xticklabels(), rotation = 30)
ax.set_title(label = titles[i], fontsize = 25, fontweight = 'bold', pad = 15)
ax.set_xlabel(None)
fig.suptitle('Individual Feature Distributions', position = (.52, 1.01),
fontsize = 30, fontweight = 'bold')
fig.tight_layout(h_pad = 2)
plt.show()
Now let's take a look at how loan_status varies across our nominal and correlated features.
Let's start with the nominal features: purpose, term, and home_ownership
import matplotlib as mpl
fig = plt.figure(figsize = (14, 10))
g = fig.add_gridspec(2, 2)
ax1 = fig.add_subplot(g[0, 0])
ax2 = fig.add_subplot(g[0, 1])
ax3 = fig.add_subplot(g[1, :])
axes = [ax1, ax2, ax3]
titles = ['Term', 'Home Ownership', 'Purpose']
var = ['term', 'home_ownership', 'purpose']
def to_percent(y,position):
return str(str(int(round(y * 100, 0))) + "%")
for i, ax in enumerate(axes):
sns.barplot(x = var[i], y = 'loan_status', data = df, palette = 'Blues',
ax = ax, edgecolor = 'darkslategray')
ax.set_ylabel('Default Rate')
ax.yaxis.set_major_formatter(mpl.ticker.FuncFormatter(to_percent))
ax.set_xlabel(None)
if i in [0, 1, 2]:
ax.set_ylabel('Default Rate')
ax.set_title(label = titles[i], fontsize = 16, fontweight = 'bold', pad = 10)
j = 0
for p in ax.patches:
ax.text(p.get_x() + p.get_width() * .25, p.get_height() + .0025,
'{0:.0%}'.format(p.get_height()), ha = 'center')
j += 1
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45)
fig.suptitle('Nominal Features', position = (.5,1.06), fontsize = 30, fontweight = 'bold')
fig.tight_layout(h_pad = 2)
Let's take a look at the most correlated features: fico & int_rate
fig = plt.figure(figsize = (10, 10))
g = fig.add_gridspec(2, 1)
ax1 = fig.add_subplot(g[0, 0])
ax2 = fig.add_subplot(g[1, 0])
axes = [ax1, ax2]
titles = ['Loans by Fico', 'Loans by Interest Rate',
'Defaults by Fico', 'Defaults by Interest Rate']
var = ['int_rate', 'fico', 'int_rate', 'fico']
for i, ax in enumerate(axes):
ax.hist(df[df['loan_status'] == 0][var[i]], bins = 25, color = 'blue',
label = 'Fully Paid', alpha = .5)
ax.hist(df[df['loan_status'] == 1][var[i]], bins = 25, color = 'red',
label = 'Defaulted', alpha = .5)
ax.legend()
ax.set_title(label = titles[i], fontsize = 16, fontweight = 'bold', pad = 10)
ax.set_ylabel('Loan Count')
if i == 0:
ax.annotate('Share of defaulted loans\nincreases with interest rate',
xy = (21.5, 5000), xytext = (22, 10000),
arrowprops = dict(facecolor = 'Green',
shrink = 0.05))
if i == 1:
ax.annotate('Share of defaulted loans\ndecreases with fico',
xy = (760, 7500), xytext = (780, 15000),
arrowprops = dict(facecolor = 'Green',
shrink = 0.05))
for i in [2, 3]:
sns.lmplot(var[i], 'loan_status', df, height = 5, aspect = 2, y_jitter = .04)
h = plt.gca()
h.yaxis.set_major_formatter(mpl.ticker.FuncFormatter(to_percent))
h.set(xlabel = None, ylabel = 'Default Rate', ylim = (-0.1, 1.19))
h.set_title(label = titles[i], fontsize = 16, fontweight = 'bold', pad = 10)
fig.suptitle('Correlated Features', position = (.52, 1.06), fontsize = 30, fontweight = 'bold')
fig.tight_layout(h_pad = 4)
In this last section of visualizations, we'll explore relationships between a few different features, and we'll compare the behavior of defaults to fully paid loans across different subpopulations of the data.
let's start by investigating the trend between int_rate and fico. In general, we would expect interest rate to decrease as fico increases. Not only does this make intuitive sense, but we see these features are negatively correlated (-0.54) in our correlation plot above. Let's see if this trend waivers at all within the four term by loan_status subpopulations.
g = sns.jointplot(x = 'fico', y = 'int_rate', data = df,
color = 'purple', kind = 'kde', height = 10)
g.fig.suptitle("Interest Rate by Fico", fontsize = 30, fontweight = 'bold')
g.fig.subplots_adjust(top = 0.91)
h = sns.lmplot('fico', 'int_rate', df, row = 'loan_status', col = 'term',
palette = 'Set1', height = 5)
h.fig.suptitle("Subpopulations: Term by Loan Status", fontsize = 20, fontweight = 'bold')
h.fig.subplots_adjust(top = 0.9)
plt.show()
Observations
Okay cool, nothing out of the ordinary here. Across all term and loan status subpopulations, we see a consistent decrease in interest rate as fico increases.
Now, let's investigate the trend between int_rate and annual_inc. Assuming that borrowers who make more money are less likely to default, we would expect interest rate to decrease as income increases. However, per the correlation plot above, it appears that interest rate has almost no correlation with annual income whatsoever (-0.01). This seems odd, given it makes intuitive sense that higher income means less risk... Let's see if this lack of correlation is consistent within term by loan_status subpopulations as well.
Note: annual_inc is highly skewed. To best visualize trends in the data, we will use the log of annual income (log_annual_inc).
df['log_annual_inc'] = np.log(df['annual_inc'])
g = sns.lmplot('log_annual_inc', 'int_rate', df, height = 5,
aspect = 2, palette = 'coolwarm', col = 'term')
g.fig.suptitle("Interest Rate by Log(Annual Income)", fontsize = 25, fontweight = 'bold')
g.fig.subplots_adjust(top = 0.75)
h = sns.lmplot('log_annual_inc', 'int_rate', df, hue = 'loan_status', height = 5,
aspect = 2, palette = 'coolwarm', col = 'term')
h.fig.suptitle("Loan Status Breakout", fontsize = 20, fontweight = 'bold')
h.fig.subplots_adjust(top = 0.8)
plt.show()
df.drop(['log_annual_inc'], axis = 1, inplace = True)
Observations
I'm not sure why this would be... putting a pin in this for now.
Finally, let's take a look at loan_amnt by term.
plt.figure(figsize = (12.5, 2))
g = sns.boxplot(x = 'loan_amnt', y = 'term', data = df)
g.set_xlabel(None)
g.set_ylabel(None)
g.set(xticklabels=[], yticklabels=[])
g.set(xticks=[], yticks=[])
plt.suptitle("Distribution of Loan Amount", fontsize = 20, fontweight = 'bold', position = (.52, 1.2))
fig = sns.FacetGrid(df, hue = 'term', aspect = 2.5, height = 5)
fig.map(sns.kdeplot, 'loan_amnt', shade = True)
fig.set(xlim = (0, df['loan_amnt'].max()), yticks=[])
fig.add_legend()
plt.show()
print('\n')
df['loan_status_term'] = df['term'] + df['loan_status'].apply(lambda x: ' default' if x==1 else ' fully paid')
plt.figure(figsize = (12.5, 2))
h = sns.boxplot(x = 'loan_amnt', y = 'loan_status_term', data = df, order = sorted(df['loan_status_term'].unique()))
h.set_xlabel(None)
h.set_ylabel(None)
h.set(xticklabels=[], yticklabels=[])
h.set(xticks=[], yticks=[])
plt.suptitle("Loan Status Breakout", fontsize = 18, fontweight = 'bold', position = (.51, 1.2))
fig = sns.FacetGrid(df, hue = 'loan_status_term', aspect = 2.5, height = 5)
fig.map(sns.kdeplot, 'loan_amnt', shade = True)
fig.set(xlim = (0, df['loan_amnt'].max()), yticks=[])
fig.add_legend(label_order = sorted(df['loan_status_term'].unique()))
plt.show()
df.drop(['loan_status_term'], axis = 1, inplace = True)
Observations
Overall, the loan amount for 60-month loans tends to be much higher than for 36-month loans. The average 60-month loan is for about $20K, whereas the average 36-month loan is about half that much.
Few 36-month loans exceed $25K.
Note: Tree-based algorithms are not sensitive to feature magnitudes. So standardizing our data (e.g. scaling and normalizing) is not necessary before fitting our three models. Here's a helpful article that explains this in more detail: When and Why to Standardize Your Data. However, at the end of this notebook I'd like to compare the results of my tree-based models to a neural network, for which standardizing data is recommended (link). We'll standardize the data now to keep everything consistent for all models.
First, we'll normalize our features with a Box-Cox transformation.
# Boxcox transform
from scipy.stats import boxcox
numerical = df.columns[df.dtypes == 'float64']
for i in numerical:
if df[i].min() > 0:
transformed, lamb = boxcox(df.loc[df[i].notnull(), i])
if np.abs(1 - lamb) > 0.02:
df.loc[df[i].notnull(), i] = transformed
Next, let's create dummy variables for our categorical features.
df_final = pd.get_dummies(df, drop_first = True)
df_final.head(3)
Great! Now, let's define our train and test populations and scale our data.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = df_final.drop('loan_status', axis = 1)
y = df_final['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state = 13)
# Scale data
sc = StandardScaler()
numerical = X_train.columns[(X_train.dtypes =='float64') | (X_train.dtypes == 'int64')].tolist()
X_train[numerical] = sc.fit_transform(X_train[numerical])
X_test[numerical] = sc.transform(X_test[numerical])
X_train = X_train.values
y_train = y_train.values
X_test = X_test.values
y_test = y_test.values
Before we proceed with training & testing our classifiers, let's create a function to cleanly plot a confusion matrix.
# Define function to plot confusion matrix
import itertools
def plot_cm(cm, classes, normalize = False, title = 'Confusion matrix', cmap = 'Blues'):
if normalize:
cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]
print('Normalized confusion matrix')
else:
print('Confusion matrix, without normalization')
plt.imshow(cm, interpolation = 'nearest', cmap = cmap)
plt.title(title, fontsize = 14)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation = 45)
plt.yticks(tick_marks, classes)
plt.ylabel('Actual')
plt.xlabel('Predicted')
fmt = '.4f' if normalize else 'd'
thresh = cm.max() / 1.5
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment = "center",
color = "white" if cm[i, j] > thresh else "black")
plt.tight_layout()
And last, let's check our class imbalance.
df['loan_status'].value_counts()/len(df)
Alright, we're ready to build our tree-based classifiers! We'll build a simple Decision Tree, a Random Forest, and an XGBoosted Tree.
Let's start with our Decision Tree.
First, a few implementation details:
SMOTE Oversampling: There is a significant class imbalance in our data (84% Fully Paid, 16% Defaulted). It is important that we create a 50/50 split in our training data because otherwise we risk the model overfitting to the majority class (i.e. always predicting "Fully Paid" would give 84% accuracy). Another way to think about this is that we want to train our model to recognize the characteristics of defaults, rather than assume that most loans will pay in full.
To create this 50/50 split in our training data, we'll oversample the defaulted loans using the SMOTE technique. With SMOTE, we increase the number of loans in the minority class (i.e. Defaulted) through creating synthetic samples in between existing ones that are in close proximity. Synthetic loans continue to be created until the counts of both the majority class (Fully Paid) and the minority class (Defaulted) are equal.
GridSearchCV: To optimize our model's performance, we'll tune our hyperparameters by looking at all combinations with GridSearchCV.
make_pipeline: We'll create a pipeline with make_pipeline to train our model on SMOTE oversampled data and cross-validate our results using GridSearchCV to ensure that we choose the best hyperparameters.
# Import libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline
# Define hyperparameter values to grid search
tree_params = {"criterion" : ["gini", "entropy"],
"max_depth" : [4],
"min_samples_leaf" : [2],
"min_samples_split" : [2]}
# Tune hyperparameters with GridSearchCV
grid_tree = GridSearchCV(estimator = DecisionTreeClassifier(),
param_grid = tree_params,
cv = 3,
verbose = 0)
# Train Decision Tree on data balanced with SMOTE oversampling
pipeline = make_pipeline(SMOTE(sampling_strategy = 'minority'), grid_tree)
pipeline.fit(X_train, y_train)
# Choose best hyperparamters and make predictions
grid_tree_best = grid_tree.best_estimator_
predictions = grid_tree_best.predict(X_test)
# Display results
print(classification_report(y_test, predictions))
print("=" * 60)
tree_cm = confusion_matrix(y_test, predictions)
labels = ['Fully Paid', 'Defaulted']
plt.figure(figsize = (6, 5))
plot_cm(cm = tree_cm, classes = labels, title = "Decision Tree\nConfusion Matrix", normalize = True)
plt.show()
plt.figure(figsize = (6, 5))
plot_cm(cm = tree_cm, classes = labels, title = "Decision Tree\nConfusion Matrix", cmap = 'Greens')
Alright, looks like our Decision Tree model was able to recognize 51% of defaults and 72% of fully paid loans. Another way to look at this is that our model sacrificed 15,702 safe loans to identify 5,337 defaults.
Let's take a look at the optimal hyperparameters chosen by GridSearchCV:
grid_tree.best_params_
Looks like entropy outperformed gini for measuring impurity. The model also chose our highest max_depth value (4), our lowest min_samples_leaf value (2), and our lowest min_samples_split value (2).
Let's see which features were most important in our model. To do this, we'll use the attribute .featureimportances, which returns the importance of features by computing their (normalized) total reduction in impurity. Here's more detail on how feature importances are measured%20calculates%20each,number%20of%20samples%20it%20splits.).
tree_model = grid_tree_best
feat = pd.DataFrame(columns = ['Feature', 'Importance'])
feat['Feature'] = X.columns
feat['Importance'] = tree_model.feature_importances_
feat.sort_values(by = 'Importance', ascending = False, inplace = True)
plt.figure(figsize = (10, 6))
g = sns.barplot(x = 'Feature', y = 'Importance', palette = 'Greens_r',
data = feat[feat['Importance'] != 0])
g.set_xticklabels(g.get_xticklabels(), rotation = 30)
g.set_ylabel('Relative Importance')
g.set_title(label = "Decision Tree Feature Importance", fontsize = 18,
fontweight = 'bold', pad = 20)
plt.show()
Interest rate was by far the most important feature for predicting defaults... which makes sense. Term also played a role in prediction, which seems consistent with the stark difference in default rate we observed between 36 and 60 month loans earlier in our analysis.
Let's visualize the decision tree to see exactly how our model is making decisions.
# Visualize grid_tree_best
from sklearn import tree
plt.figure(figsize = (30, 10))
tree.plot_tree(grid_tree.best_estimator_,
feature_names = df_final.drop('loan_status', axis = 1).columns,
class_names = ['Fully Paid', 'Defaulted'],
filled = True)
plt.show()
Great! We can see what a significant role interest rate plays in reducing entropy throughout our model. For more on how to read/visualize decision trees, checkout this article.
Finally, let's create a dataframe to track the results of our models to compare their performances later on.
model_df = pd.DataFrame(columns = ['model', 'false_negatives', 'false_positives'])
model_df.loc[len(model_df)] = ['Baseline', sum(y_test), 0]
model_df.loc[len(model_df)] = ['Decision Tree', tree_cm[1, 0], tree_cm[0, 1]]
model_df
Awesome! Now let's train and test our Random Forest model.
A Random Forest classification model consists of many Decision Trees. It performs classification by choosing the most common class predicted by its trees. Understanding Random Forests
RandomizedSearchCV: To optimize the performance of our Random Forest, we'll tune our hyperparameters with RandomizedSearchCV instead of GridSearchCV. With GridSearchCV, every combination of hyperparameters are tried. RandomizedSearchCV, on the other hand, tries only a predefined number (n_iter) of random combinations of hyperparameters. Using RandomizedSearchCV is preferable here because testing every combination of hyperparameters would become too costly / take too much time. RandomizedSearchCV will also be preferable for training our XGBoost model, where we'll want to define certain hyperparamters as continuous distributions. This article gives a good, high-level comparison of GridSearch and RandomizedSearch.
# Import libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
# Define hyperparameter values to random search
rf_params = {'n_estimators' : [100, 300, 600, 1000],
'max_features' : ['auto', 'log2'],
'max_depth' : randint(2, 5),
'min_samples_split' : randint(2, 5),
'min_samples_leaf' : randint(2, 5),
'bootstrap' : [True, False]}
# Tune hyperparameters with RandomizedSearchCV
grid_rf = RandomizedSearchCV(estimator = RandomForestClassifier(),
param_distributions = rf_params,
n_iter = 200,
cv = 3,
verbose = 0,
random_state = 13,
n_jobs = -1)
# Train Random Forest on data balanced with SMOTE oversampling
pipeline = make_pipeline(SMOTE(sampling_strategy = 'minority'), grid_rf)
pipeline.fit(X_train, y_train)
# Choose best hyperparamters and make predictions
grid_rf_best = grid_rf.best_estimator_
predictions = grid_rf_best.predict(X_test)
# Display results
print(classification_report(y_test, predictions))
print("=" * 60)
rf_cm = confusion_matrix(y_test, predictions)
labels = ['Fully Paid', 'Defaulted']
plt.figure(figsize = (6, 5))
plot_cm(cm = rf_cm, classes = labels, title = "Random Forest\nConfusion Matrix", normalize = True)
plt.show()
plt.figure(figsize = (6, 5))
plot_cm(cm = rf_cm, classes = labels, title = "Random Forest\nConfusion Matrix", cmap = 'Purples')
Good, looks like our Random Forest model was able to recognize 61% of defaults and 65% of fully paid loans. Another way to look at this is that our model sacrificed 19,532 safe loans to identify 6,323 defaults.
Let's take a look at the optimal hyperparameters chosen by RandomizedSearchCV:
grid_rf.best_params_
Looks like bootstrapping performed best, and that the log2 method of selecting features to consider at each split also performed best. The model chose our highest max_depth value (4), our highest min_samples_leaf value (4), and a min_samples_split value of (3). Finally, our Random Forest performed best when including 600 decision trees.
Let's see which features were most important in our model.
tree_model = grid_rf_best
feat = pd.DataFrame(columns = ['Feature', 'Importance'])
feat['Feature'] = X.columns
feat['Importance'] = tree_model.feature_importances_
feat.sort_values(by = 'Importance', ascending = False, inplace = True)
plt.figure(figsize = (10, 6))
g = sns.barplot(x = 'Feature', y = 'Importance', palette = 'Purples_r',
data = feat[feat['Importance'] != 0])
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
g.set_ylabel('Relative Importance')
g.set_title(label = "Random Forest Feature Importance", fontsize = 18,
fontweight = 'bold', pad = 20)
plt.show()
Once again, we see that interest rate and loan term are the two most important features, respectively. In our Random Forest, however, term seems to play a larger role than it did in our simple Decision Tree model. We also see a handful of additional features with lower importance (loan_amount, home_ownership, dti etc.)
model_df.loc[len(model_df)] = ['Random Forest', rf_cm[1, 0], rf_cm[0, 1]]
model_df
Great! Now we're ready to train and test our XGBoost Tree model.
XGBoosted Trees are similar to Random Forests, in that both models combine the results of a set of Decision Trees. XGBoost differs from Random Forest in the way it builds those Decision Trees. Random Forests build each tree independently. XGBoost builds one tree at a time in a forward stage-wise manner, introducing a weak learner after each tree is built to improve the shortcomings of existing weak learners. More on this.
By carefully tuning parameters, gradient boosting can result in better performance than random forests. However, gradient boosting may not be a good choice if you have a lot of noise, as it can result in overfitting. It also tends to be harder to tune than random forests.
# Import libraries
import xgboost as xgb
from scipy.stats import uniform
# Define hyperparameter values to random search
xgb_params = {'n_estimators' : randint(50, 300),
'max_depth' : randint(2, 5),
'min_samples_split' : randint(2, 5),
'min_samples_leaf' : randint(2, 5),
'min_child_weight' : uniform(loc = 1, scale = 0.5),
'gamma' : uniform(loc = 0.6, scale = 0.4),
'reg_lambda' : uniform(loc = 1, scale = 2),
'reg_alpha' : uniform(loc = 0, scale = 1),
'learning_rate' : uniform(loc = .001, scale = .009)}
# Tune hyperparameters with RandomizedSearchCV
grid_xgb = RandomizedSearchCV(estimator = xgb.XGBClassifier(),
param_distributions = xgb_params,
n_iter = 250,
cv = 3,
verbose = 0,
random_state = 13,
n_jobs = -1)
# Train XGBoost on data balanced with SMOTE oversampling
pipeline = make_pipeline(SMOTE(sampling_strategy = 'minority'), grid_xgb)
pipeline.fit(X_train, y_train)
# Choose best hyperparamters and make predictions
grid_xgb_best = grid_xgb.best_estimator_
predictions = grid_xgb_best.predict(X_test)
# Display results
print(classification_report(y_test, predictions))
print("=" * 60)
xgb_cm = confusion_matrix(y_test, predictions)
labels = ['Fully Paid', 'Defaulted']
plt.figure(figsize = (6, 5))
plot_cm(cm = xgb_cm, classes = labels, title = "XGBoost\nConfusion Matrix", normalize = True)
plt.show()
plt.figure(figsize = (6,5))
plot_cm(cm = xgb_cm, classes = labels, title = "XGBoost\nConfusion Matrix", cmap = 'Reds')
Great, looks like our XGBoost model was able to recognize 57% of defaults and 68% of fully paid loans. Another way to look at this is that our model sacrificed 18,125 safe loans to identify 5,919 defaults.
Let's take a look at the optimal hyperparameters chosen by RandomizedSearchCV, as well as the features that were most important in our model:
grid_xgb.best_params_
tree_model = grid_xgb_best
feat = pd.DataFrame(columns = ['Feature', 'Importance'])
feat['Feature'] = X.columns
feat['Importance'] = tree_model.feature_importances_
feat.sort_values(by = 'Importance', ascending = False, inplace = True)
plt.figure(figsize = (10,6))
g = sns.barplot(x = 'Feature', y = 'Importance', palette = 'Reds_r',
data = feat[feat['Importance'] != 0])
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
g.set_ylabel('Relative Importance')
g.set_title(label = "XGBoost Feature Importance", fontsize = 18,
fontweight = 'bold', pad = 20)
plt.show()
Interesting! Term is now at the top of the list, followed by type of homeownership, followed by interest rate. Purpose = debt consolidation also played a significant role.
model_df.loc[len(model_df)] = ['XGBoost', xgb_cm[1, 0], xgb_cm[0, 1]]
model_df
Now let's implement a simple Neural Network to see how it performs against our tree-based models!
To create our Neural Network, let's have one input layer with the same number of nodes as features plus a bias node, a second hidden layer with 38 nodes, and one output node classifying the loan as 0 (Fully Paid) or 1 (Defaulted).
import itertools
import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers import Activation
from keras.layers.core import Dense
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy
sm = SMOTE(sampling_strategy = 'minority', random_state = 13)
X_train_sm, y_train_sm = sm.fit_sample(X_train, y_train)
NN = Sequential([Dense(X_train.shape[1], input_shape = (X_train.shape[1], ), activation = 'relu'),
Dense(38, activation = 'relu'),
Dense(2, activation = 'softmax')])
NN.summary()
NN.compile(Adam(lr = 0.001), metrics = ['accuracy'], loss = 'sparse_categorical_crossentropy')
NN.fit(X_train_sm, y_train_sm, validation_split = 0.2, batch_size = 250, epochs = 25,
shuffle = True, verbose = 2)
pred = NN.predict_classes(X_test)
NN_cm = confusion_matrix(y_test, pred)
labels = ['Fully Paid', 'Defaulted']
plt.figure(figsize = (6, 5))
plot_cm(cm = NN_cm, classes = labels, title = "NN\nConfusion Matrix", normalize = True)
plt.figure(figsize = (6, 5))
plot_cm(cm = NN_cm, classes = labels, title = "NN\nConfusion Matrix", cmap = 'Oranges')
Awesome, looks like our Neural Network was able to recognize 39% of defaults and 80% of fully paid loans. Another way to look at this is that our model sacrificed 11,183 safe loans to identify 4,025 defaults.
model_df.loc[len(model_df)] = ['Neural Network', NN_cm[1, 0], NN_cm[0, 1]]
model_df
We're ready to compare the results of our models.
Evaluating Performance: A perfect model would be able to predict all loans correctly, allowing us to avoid all future defaults and only accept loans that will pay in full. While that's a nice thought, none of our models are perfect. Instead, we want to find which of our models does the best job of minimizing total loss. Loss comes from misclassifications. There are two types:
Misclassifying loans that will default: When our model approves a loan that will default in the future, we lose the remaining balance plus interest to be paid. These are false_negatives.
Misclassifying loans that will pay in full: When our model denies loans that will pay in full, we lose the interest we could have made on that loan. These are false_positives.
We can calculate our total loss as:
Total Loss = (false_negatives * false_negative_avg_cost) + (false_positives * false_positive_avg_cost)
While we don't know the average cost of false negatives or the average cost of false positives, we can look at the ratio between the two to evalute the above equation. Using this ratio allows us to simplify the above equation as follows:
loss_ratio = false_positive_avg_cost / false_negative_avg_cost
=> Total Loss = (false_negatives * false_negative_avg_cost) +
(false_positives * false_negative_avg_cost * loss_ratio)
=> Total Loss / false_negative_avg_cost = false_negatives + false_positives * loss_ratio
The best model is the one that minimizes false_negatives + false_positives * loss_ratio for a given loss_ratio.
It makes sense that false negatives are more costly than false positives (i.e. we lose more misclassifying defaults than safe loans). Thus, we'll just look at loss_ratios between 0 and 1 to evaluate model performance.
loss_ratio = np.linspace(0, 1, 1000) #false_positive_avg_cost / false_negative_avg_cost
# plot total loss for all models across for loss_ratio 0-1.
plt.figure(figsize = (16, 10))
for i in range(len(model_df)):
x = loss_ratio
y = model_df['false_negatives'][i] * np.ones(len(loss_ratio)) + model_df['false_positives'][i] * loss_ratio
plt.plot(x, y)
plt.xticks(np.arange(0, 1, 0.02), rotation = 90)
plt.grid()
plt.xlabel('\nLoss Ratio\n\n(cost of misclassifying a safe loan) /\n(cost of misclassifying a default)')
plt.ylabel('Total Loss\n\n(divided by avg cost of misclassifying a default)')
plt.title('\nComparing Model Performance: Total Loss', fontsize = 20, pad = 20, fontweight = 'bold')
labels = model_df['model']
plt.legend(labels)
plt.show()
Great! Looking at the minimum values in the graph above, we should choose the following model under each loss_ratio condition.
For no loss_ratio was the Decision Tree model the best choice. Why Random Forests Outperform Decision Trees
So what's a reasonable loss_ratio to assume? Well, that depends on how much we expect to make on a safe loan vs how much we expect to lose on a default. While both of these values depend on a wide range of factors, we can (grossly) oversimplify our estimation as follows
Assumptions
Amount we expect to make if loan is fully paid (link to calculator) = $132.27
Amount we expect to lose on default = $566.14
loss_ratio = 132.27 / 566.14 = 23.4%
In this case, our best model would be the Random Forest.
Thanks for going on this journey with me! If you have any questions, suggestions for improving upon my approach, or just like my work, please don't hesitate to reach out at keilordykengilbert@gmail.com. Having a conversation is the best way for everyone involved to learn & improve 🌲😊🌲
! jupyter nbconvert --to html Personal_Loans_and_Decision_Trees.ipynb