Personal Loans & Tree-Based Models 🌲¶

In this project, we'll be investigating historical loan data, with the ultimate goal of developing the best tree-based classification model to separate defaults from safe loans. We'll compare the performances of Decision Trees, Random Forests, Gradient Boosted Trees, and Neural Networks.

The Data¶

We'll be using historical personal loan data from LendingClub.com. LendingClub is a P2P lending platform that connects people who need money (borrowers) with people who have money (investors). As an investor, you would want to invest in borrowers who show a profile with a low probability of defaulting.

The data is from 2010-2013. I chose loans from this time range for two reasons. First, the maximum term for LendingClub loans is 5 years, so we would have realized all defaults by the date this data was pulled (early 2020). Second, I wanted to avoid including loans heavily affected by the Great Recession (2007-2009).

Project Inspiration¶

There are a few reasons why I decided to do this project.

I used to work as a risk analyst for a marketplace lender that refinanced student loans. In my job, I built the company's underwriting and pricing models. While I evaluated applicants on factors like FICO score, DTI, free cash flow, and delinquencies to approve these loans, many of the cutoffs were predetermined by our financing partners, and I never actually built a predictive ML model.
I'm eager to get a bit more experience working with different tree classifiers. I'd like to improve my understanding of the strengths and weaknesses of each model, and see how each performs under different hyperparameters. Finally, I'd like to create a strategy that identifies the best model for different cost assumptions (cost of misclassifying defaults, cost of misclassifying safe loans).
I'm considering becoming an investor on LendingClub's platform. I'd like to have a perspective on whether a loan will default, and building a classification model with LendingClub's historical loan data will help me do that.

Approach¶

We'll take a systematic approach to exploring our data and developing our models.

Import Data
Data Preprocessing
- Remove Unnecessary Columns
- Null Values
- Feature Engineering
- Exploring Our Target Variable
Correlations
Visualizations
- Feature Distributions
- Nominal & Correlated Features
- Additional Views
Additional Preprocessing
- Box-Cox
- Dummy Variables
- Scaling
- Confusion Matrix Function
Tree-Based Classifiers
- Decision Tree
- Random Forest
- XGBoost
Neural Network
Conclusion
- Final Thoughts

If you have any questions, suggestions for improving upon my approach, or just like my work, please don't hesitate to reach out at keilordykengilbert@gmail.com. Having a conversation is the best way for everyone involved to learn & improve 😊

Alright! I'm excited to revisit my old lending stomping grounds, gain experience working with tree-based models, and build the best predictive strategy for my investing needs. Let's get started!!!

Import Data¶

In [1]:

# Import useful libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

In [2]:

pd.set_option('display.max_columns', None)

df = pd.read_csv('loans_2010_to_2013.csv')
df.head()

Out[2]:

	Unnamed: 0	id	member_id	loan_amnt	funded_amnt	funded_amnt_inv	term	int_rate	installment	grade	sub_grade	emp_title	emp_length	home_ownership	annual_inc	verification_status	issue_d	loan_status	pymnt_plan	url	desc	purpose	title	zip_code	addr_state	dti	earliest_cr_line	fico_range_low	fico_range_high	inq_last_6mths	mths_since_last_delinq	mths_since_last_record	open_acc	revol_bal	revol_util	total_acc	initial_list_status	total_pymnt	total_pymnt_inv	total_rec_prncp	total_rec_int	total_rec_late_fee	recoveries	collection_recovery_fee	last_pymnt_d	last_pymnt_amnt	next_pymnt_d	last_credit_pull_d	last_fico_range_high	last_fico_range_low	mths_since_last_major_derog	policy_code	application_type	annual_inc_joint	dti_joint	verification_status_joint	tot_coll_amt	tot_cur_bal	open_acc_6m	open_act_il	open_il_12m	open_il_24m	mths_since_rcnt_il	total_bal_il	il_util	open_rv_12m	open_rv_24m	max_bal_bc	all_util	total_rev_hi_lim	inq_fi	total_cu_tl	inq_last_12m	acc_open_past_24mths	avg_cur_bal	bc_open_to_buy	bc_util	mo_sin_old_il_acct	mo_sin_old_rev_tl_op	mo_sin_rcnt_rev_tl_op	mo_sin_rcnt_tl	mort_acc	mths_since_recent_bc	mths_since_recent_bc_dlq	mths_since_recent_inq	mths_since_recent_revol_delinq	num_accts_ever_120_pd	num_actv_bc_tl	num_actv_rev_tl	num_bc_sats	num_bc_tl	num_il_tl	num_op_rev_tl	num_rev_accts	num_rev_tl_bal_gt_0	num_sats	num_tl_120dpd_2m	num_tl_30dpd	num_tl_90g_dpd_24m	num_tl_op_past_12m	pct_tl_nvr_dlq	percent_bc_gt_75	tot_hi_cred_lim	total_bal_ex_mort	total_bc_limit	total_il_high_credit_limit	revol_bal_joint	sec_app_fico_range_low	sec_app_fico_range_high	sec_app_earliest_cr_line	sec_app_inq_last_6mths	sec_app_mort_acc	sec_app_open_acc	sec_app_revol_util	sec_app_open_act_il	sec_app_num_rev_accts	sec_app_chargeoff_within_12_mths	sec_app_collections_12_mths_ex_med	sec_app_mths_since_last_major_derog	hardship_flag	hardship_type	hardship_reason	hardship_status	deferral_term	hardship_amount	hardship_start_date	hardship_end_date	payment_plan_start_date	hardship_length	hardship_dpd	hardship_loan_status	orig_projected_additional_accrued_interest	hardship_payoff_balance_amount	hardship_last_payment_amount	disbursement_method	debt_settlement_flag	debt_settlement_flag_date	settlement_status	settlement_date	settlement_amount	settlement_percentage	settlement_term
0	1611879	1077501	NaN	5000.0	5000.0	4975.0	36 months	10.65	162.87	B	B2	NaN	10+ years	RENT	24000.0	Verified	Dec-2011	Fully Paid	n	https://lendingclub.com/browse/loanDetail.acti...	Borrower added on 12/22/11 > I need to upgra...	credit_card	Computer	860xx	AZ	27.65	Jan-1985	735.0	739.0	1.0	NaN	NaN	3.0	13648.0	83.7	9.0	f	5863.155187	5833.84	5000.00	863.16	0.00	0.0	0.00	Jan-2015	171.62	NaN	Dec-2018	749.0	745.0	NaN	1.0	Individual	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Cash	N	NaN	NaN	NaN	NaN	NaN	NaN
1	1611880	1077430	NaN	2500.0	2500.0	2500.0	60 months	15.27	59.83	C	C4	Ryder	< 1 year	RENT	30000.0	Source Verified	Dec-2011	Charged Off	n	https://lendingclub.com/browse/loanDetail.acti...	Borrower added on 12/22/11 > I plan to use t...	car	bike	309xx	GA	1.00	Apr-1999	740.0	744.0	5.0	NaN	NaN	3.0	1687.0	9.4	4.0	f	1014.530000	1014.53	456.46	435.17	0.00	122.9	1.11	Apr-2013	119.66	NaN	Oct-2016	499.0	0.0	NaN	1.0	Individual	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Cash	N	NaN	NaN	NaN	NaN	NaN	NaN
2	1611881	1077175	NaN	2400.0	2400.0	2400.0	36 months	15.96	84.33	C	C5	NaN	10+ years	RENT	12252.0	Not Verified	Dec-2011	Fully Paid	n	https://lendingclub.com/browse/loanDetail.acti...	NaN	small_business	real estate business	606xx	IL	8.72	Nov-2001	735.0	739.0	2.0	NaN	NaN	2.0	2956.0	98.5	10.0	f	3005.666844	3005.67	2400.00	605.67	0.00	0.0	0.00	Jun-2014	649.91	NaN	Jun-2017	739.0	735.0	NaN	1.0	Individual	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Cash	N	NaN	NaN	NaN	NaN	NaN	NaN
3	1611882	1076863	NaN	10000.0	10000.0	10000.0	36 months	13.49	339.31	C	C1	AIR RESOURCES BOARD	10+ years	RENT	49200.0	Source Verified	Dec-2011	Fully Paid	n	https://lendingclub.com/browse/loanDetail.acti...	Borrower added on 12/21/11 > to pay for prop...	other	personel	917xx	CA	20.00	Feb-1996	690.0	694.0	1.0	35.0	NaN	10.0	5598.0	21.0	37.0	f	12231.890000	12231.89	10000.00	2214.92	16.97	0.0	0.00	Jan-2015	357.48	NaN	Apr-2016	604.0	600.0	NaN	1.0	Individual	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Cash	N	NaN	NaN	NaN	NaN	NaN	NaN
4	1611883	1075358	NaN	3000.0	3000.0	3000.0	60 months	12.69	67.79	B	B5	University Medical Group	1 year	RENT	80000.0	Source Verified	Dec-2011	Fully Paid	n	https://lendingclub.com/browse/loanDetail.acti...	Borrower added on 12/21/11 > I plan on combi...	other	Personal	972xx	OR	17.94	Jan-1996	695.0	699.0	0.0	38.0	NaN	15.0	27783.0	53.9	38.0	f	4066.908161	4066.91	3000.00	1066.91	0.00	0.0	0.00	Jan-2017	67.30	NaN	Apr-2018	684.0	680.0	NaN	1.0	Individual	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Cash	N	NaN	NaN	NaN	NaN	NaN	NaN

In [3]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 222439 entries, 0 to 222438
Columns: 152 entries, Unnamed: 0 to settlement_term
dtypes: float64(115), int64(2), object(35)
memory usage: 258.0+ MB

Data Preprocessing¶

Remove Unnecessary Columns¶

The dataset has over 152 columns, many with duplicative and missing information. Let's just use the following for now:

loan_status (target variable): Whether the loan was paid off or charged off (defaulted).
int_rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
grade: Rating for the quality/riskiness of the loan (riskiness increases alphabetically).
sub_grade: Sub-rating for the quality/riskiness of the loan (riskiness increases numerically).
term: Length of time to payoff loan.
dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
loan_amnt: Total amount lended.
home_ownership: Applicant's housing type.
installment: The monthly installments owed by the borrower if the loan is funded.
purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
emp_length: Applicant's length of employment.
delinq_2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
earliest_cr_line: Start date of earliest credit line.
issue_d: Date loan was issued.
annual_inc: The self-reported annual income of the borrower.
fico_range_high, fico_range_low: upper and lower bound of borrower's fico score.

In [4]:

df = df[['loan_status', 'int_rate', 'grade', 'sub_grade', 'term',
         'dti', 'loan_amnt', 'home_ownership', 'installment', 'purpose',
         'emp_length', 'delinq_2yrs', 'earliest_cr_line', 'issue_d', 'annual_inc',
         'fico_range_low', 'fico_range_high']]

df.head()

Out[4]:

	loan_status	int_rate	grade	sub_grade	term	dti	loan_amnt	home_ownership	installment	purpose	emp_length	earliest_cr_line	issue_d	annual_inc	fico_range_low	fico_range_high
0	Fully Paid	10.65	B	B2	36 months	27.65	5000.0	RENT	162.87	credit_card	10+ years	Jan-1985	Dec-2011	24000.0	735.0	739.0
1	Charged Off	15.27	C	C4	60 months	1.00	2500.0	RENT	59.83	car	< 1 year	Apr-1999	Dec-2011	30000.0	740.0	744.0
2	Fully Paid	15.96	C	C5	36 months	8.72	2400.0	RENT	84.33	small_business	10+ years	Nov-2001	Dec-2011	12252.0	735.0	739.0
3	Fully Paid	13.49	C	C1	36 months	20.00	10000.0	RENT	339.31	other	10+ years	Feb-1996	Dec-2011	49200.0	690.0	694.0
4	Fully Paid	12.69	B	B5	60 months	17.94	3000.0	RENT	67.79	other	1 year	Jan-1996	Dec-2011	80000.0	695.0	699.0

Null Values¶

Now let's see how many null values we have and how best to address them.

In [5]:

# Create function to show number & share of nulls for each feature
def null_percentage(df):
    total = df.isnull().sum().sort_values(ascending = False)
    total = total[total != 0]
    percent = round(100 * total / len(df), 4)
    return pd.concat([total, percent], axis=1, keys=['Total Nulls', 'Percent Null'])

null_percentage(df)

Out[5]:

	Total Nulls	Percent Null
emp_length	8999	4.0456

In [6]:

# Let's also visualize the nulls in our table with missingno
import missingno as msno

msno.matrix(df)
plt.show()

Great! Given employment length is categorical, and there may be a meaningful reason for why this information is missing, we will create the 'N/A' type for nulls.

In [7]:

# 'N/A' if null
for i in ['emp_length']:
    df[i] = df[i].fillna('N/A')

null_percentage(df)

Out[7]:

	Total Nulls	Percent Null

Feature Engineering¶

Using the data in our table, we'll create the following features:

fico: Average of fico_range_low and fico_range_high.
cr_hist_days: The number of days since borrower first opened a credit line.

In [8]:

# Create fico
df['fico'] = (df['fico_range_low'] + df['fico_range_high']) / 2

# Create cr_hist_days
df['issue_d'] = pd.to_datetime(df['issue_d'])
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'])
df['cr_hist_days'] = (df['issue_d'] - df['earliest_cr_line']).dt.days

# Drop unnecessary features
df.drop(['issue_d', 'earliest_cr_line', 'fico_range_low', 'fico_range_high'],
        axis=1, inplace=True)

# Change delinq_2yrs datatype to int
df['delinq_2yrs'] = df['delinq_2yrs'].astype(int)

df.head()

Out[8]:

	loan_status	int_rate	grade	sub_grade	term	dti	loan_amnt	home_ownership	installment	purpose	emp_length	annual_inc	fico	cr_hist_days
0	Fully Paid	10.65	B	B2	36 months	27.65	5000.0	RENT	162.87	credit_card	10+ years	24000.0	737.0	9830
1	Charged Off	15.27	C	C4	60 months	1.00	2500.0	RENT	59.83	car	< 1 year	30000.0	742.0	4627
2	Fully Paid	15.96	C	C5	36 months	8.72	2400.0	RENT	84.33	small_business	10+ years	12252.0	737.0	3682
3	Fully Paid	13.49	C	C1	36 months	20.00	10000.0	RENT	339.31	other	10+ years	49200.0	692.0	5782
4	Fully Paid	12.69	B	B5	60 months	17.94	3000.0	RENT	67.79	other	1 year	80000.0	697.0	5813

Exploring Our Target Variable¶

Let's take a look at the different classes in our target variable, loan_status.

In [9]:

plt.figure(figsize=(12,8))

g = sns.countplot(df['loan_status'],
                  edgecolor = 'darkslategray',
                  palette = sns.color_palette('BrBG_r', 7))

g.set_xticklabels(g.get_xticklabels(), rotation=90)
g.set_xlabel(None)
g.set_ylabel('Loan Count')
g.set_title(label = "Loan Status", fontsize=25, fontweight='bold', pad=20)

for p in g.patches:
    g.text(p.get_x() + p.get_width() * .5,
           p.get_height() + 1000,
           '{0:.2%}'.format(p.get_height()/len(df)),
           ha = 'center')

plt.show()

Great! We're only interested in loans that were either Fully Paid or Charged Off (i.e. Defaulted). Let's drop the other loan statuses (which represent <0.5% of the data) and change the values of Fully Paid to 1 and Charged Off to 0.

In [10]:

# Drop other loan_status outcomes
desired_statuses = ['Fully Paid', 'Charged Off']
df = df[df['loan_status'].isin(desired_statuses)]

# Change loan_status to 0 (Fully Paid) or 1 (Charged Off)
df['loan_status'] = df['loan_status'].apply(lambda x: 0 if x=='Fully Paid' else 1)

print("{:.0f} loans remain.".format(len(df)))
print()
df['loan_status'].value_counts()

221428 loans remain.

Out[10]:

0    186976
1     34452
Name: loan_status, dtype: int64

Correlations¶

Let's take a look at correlations between our features.

Correlations with Target Variable: Identify which features best predict defaults.
Correlations between Other Features: Identify which feautres are highly correlated and may be duplicative.

We'll assign numerical values to our non-numeric features using .astype('category').cat.codes. This converts the feature data to a categorical data type, identifies the unique values in lexicographical order, and assigns the integers [0,1,2,...], respectively. For ordinal features like grade, sub_grade, and most of empl_length, this preserves the information contained in the feature's order. This blog post does a good job of explaining how this works in more detail.

In [11]:

plt.figure(figsize=(16, 12))
sns.set_context('paper', font_scale = 1)

sns.heatmap(df.assign(home_ownership = df.home_ownership.astype('category').cat.codes,
                      purpose = df.purpose.astype('category').cat.codes,
                      grade = df.grade.astype('category').cat.codes,
                      sub_grade = df.sub_grade.astype('category').cat.codes,
                      emp_length = df.emp_length.astype('category').cat.codes,
                      term = df.term.astype('category').cat.codes).corr(),
            annot = True, cmap = 'RdYlGn', vmin = -1, vmax = 1, linewidths = 0.5)
plt.show()

Correlations with Target Variable

While our features are not highly correlated with loan_status in general, we will visualize how loan_status varies with some of its more correlated features (as well as nominal features) later in this notebook.

Correlations between Features

int_rate, grade, and sub_grade are all highly correlated with each other (.95+) and provide duplicative information. Our tree algorithms will be most precise with a continuous numerical variable, so let's keep int_rate and drop grade and sub_grade.
loan_amnt is highly correlated with installment. Given installment is less correlation with loan_status, let's remove this feature as well.

In [12]:

df.drop(['grade','sub_grade','installment'], axis=1, inplace=True)

Visualizations¶

Time to explore our data with some visualizations!

Let's start by looking at the distribution of each feature individually. We'll finish up by investigating how loan_status varies across both its most-correlated features as well as nominal features (e.g. purpose).

Feature Distributions¶

In [13]:

fig, axes = plt.subplots(10,1,figsize=(16,60))

# List features & titles for each chart
var = ['int_rate', 'dti', 'fico', 'cr_hist_days', 'loan_amnt',
       'annual_inc', 'emp_length', 'home_ownership', 'purpose', 'term']

titles = ['Interest Rate', 'Debt to Income', 'Fico Score',
          'Length of Credit History (Days)', 'Loan Amount', 'Annual Income',
          'Employment Length', 'Home Ownership', 'Purpose', 'Term']

# Graph each feature by enumerating axes and using a for loop
for i, ax in enumerate(axes.flatten()):
    if i in [0,1,2,3,4]:
        sns.distplot(df[var[i]], ax = ax, bins = 80,
                     kde_kws = {'color' : 'darkolivegreen',
                                'label' : 'Kde',
                                'gridsize' : 1000,
                                'linewidth' : 3},
                     hist_kws = {'color' : 'goldenrod',
                                 'label' : "Histogram",
                                 'edgecolor' : 'darkslategray'})
    if i in [5]:
        sns.boxplot(df[var[i]], ax = ax)
    if i == 6:
        sns.countplot(df[var[i]], ax = ax,
                      order = ['N/A', '< 1 year', '1 year', '2 years', '3 years',
                             '4 years', '5 years', '6 years', '7 years', '8 years',
                             '9 years', '10+ years'])
    if i in [7, 8, 9]:
        sns.countplot(df[var[i]], ax = ax, order = df[var[i]].value_counts().index)
    if i == 8:
        ax.set_xticklabels(ax.get_xticklabels(), rotation = 30)
    ax.set_title(label = titles[i], fontsize = 25, fontweight = 'bold', pad = 15)
    ax.set_xlabel(None)

fig.suptitle('Individual Feature Distributions', position = (.52, 1.01),
             fontsize = 30, fontweight = 'bold')
fig.tight_layout(h_pad = 2)
plt.show()

Observations¶

Interest Rate: Roughly normally distributed between 5-25%, mean around 13%.
Debt to Income: Roughly normally distributed between 0-35%, mean around 15%.
Fico Score: Lognormally distributed between 660-850. It would appear LendingClub's underwriting cutoff is 660.
Length of Credit History (Days): While the distribution does tail off to the right, it's fairly normally distributed within 0-10,000 days around a mean of about 5,000 days.
Loan Amount: Ranges between \$0-\\$35K with a slight right skew. Appears LendingClub did not issue loans above $35K.
Annual Income: Heavily skewed to the right.
Employment Length: No clear pattern, about a third of the population has 10+ years of employment.
Home Ownership: About 50% of borrowers have a mortgage, roughly 40% rent, and most of the rest own.
Purpose: More than half the borrowers are using their loans for debt consolidation, next most popular is to pay off a credit card balance.

Nominal & Correlated Features¶

Now let's take a look at how loan_status varies across our nominal and correlated features.

Let's start with the nominal features: purpose, term, and home_ownership

In [14]:

import matplotlib as mpl

fig = plt.figure(figsize = (14, 10))

g = fig.add_gridspec(2, 2)
ax1 = fig.add_subplot(g[0, 0])
ax2 = fig.add_subplot(g[0, 1])
ax3 = fig.add_subplot(g[1, :])

axes = [ax1, ax2, ax3]

titles = ['Term', 'Home Ownership', 'Purpose']

var = ['term', 'home_ownership', 'purpose']

def to_percent(y,position):
    return str(str(int(round(y * 100, 0))) + "%")

for i, ax in enumerate(axes):
    sns.barplot(x = var[i], y = 'loan_status', data = df, palette = 'Blues',
                ax = ax, edgecolor = 'darkslategray')
    ax.set_ylabel('Default Rate')
    ax.yaxis.set_major_formatter(mpl.ticker.FuncFormatter(to_percent))
    ax.set_xlabel(None)
    if i in [0, 1, 2]:
        ax.set_ylabel('Default Rate')
    ax.set_title(label = titles[i], fontsize = 16, fontweight = 'bold', pad = 10)
    j = 0
    for p in ax.patches:
        ax.text(p.get_x() + p.get_width() * .25, p.get_height() + .0025,
                '{0:.0%}'.format(p.get_height()), ha = 'center')
        j += 1
    ax.set_xticklabels(ax.get_xticklabels(), rotation = 45)

fig.suptitle('Nominal Features', position = (.5,1.06), fontsize = 30, fontweight = 'bold')
fig.tight_layout(h_pad = 2)

Observations¶

Term: The probability of default on a 5-year loan is twice as high as the probability of default on a 3-year loan. This makes intuitive sense, as a longer term consists of more payments and thus more opportunities to default. More about this on Investopedia.
Home Ownership: There do not appear to be any significant trends between home_ownership and loan_status. While the average default rate is high for the OTHER category, there are not enough samples here for us to have confidence this default rate is higher.
Purpose: Clearly, certain loan purposes are riskier than others. For example, a small business loan (26% default rate) is much more risky than an auto loan (11% default rate).

Let's take a look at the most correlated features: fico & int_rate

In [15]:

fig = plt.figure(figsize = (10, 10))

g = fig.add_gridspec(2, 1)
ax1 = fig.add_subplot(g[0, 0])
ax2 = fig.add_subplot(g[1, 0])

axes = [ax1, ax2]

titles = ['Loans by Fico', 'Loans by Interest Rate',
          'Defaults by Fico', 'Defaults by Interest Rate']

var = ['int_rate', 'fico', 'int_rate', 'fico']

for i, ax in enumerate(axes):
    ax.hist(df[df['loan_status'] == 0][var[i]], bins = 25, color = 'blue',
            label = 'Fully Paid', alpha = .5)
    ax.hist(df[df['loan_status'] == 1][var[i]], bins = 25, color = 'red',
            label = 'Defaulted', alpha = .5)
    ax.legend()
    ax.set_title(label = titles[i], fontsize = 16, fontweight = 'bold', pad = 10)
    ax.set_ylabel('Loan Count')
    if i == 0:
        ax.annotate('Share of defaulted loans\nincreases with interest rate',
                    xy = (21.5, 5000), xytext = (22, 10000),
                    arrowprops = dict(facecolor = 'Green',
                                      shrink = 0.05))
    if i == 1:
        ax.annotate('Share of defaulted loans\ndecreases with fico',
                    xy = (760, 7500), xytext = (780, 15000),
                    arrowprops = dict(facecolor = 'Green',
                                      shrink = 0.05))
for i in [2, 3]:
    sns.lmplot(var[i], 'loan_status', df, height = 5, aspect = 2, y_jitter = .04)
    h = plt.gca()
    h.yaxis.set_major_formatter(mpl.ticker.FuncFormatter(to_percent))
    h.set(xlabel = None, ylabel = 'Default Rate', ylim = (-0.1, 1.19))
    h.set_title(label = titles[i], fontsize = 16, fontweight = 'bold', pad = 10)

fig.suptitle('Correlated Features', position = (.52, 1.06), fontsize = 30, fontweight = 'bold')
fig.tight_layout(h_pad = 4)

Observations¶

Fico: In general, the default rate decrease with fico. Note in the histogram how the ratio of defaulted loans to fully-paid loans decreases as fico increases. The lmplot also shows how defaults and fico are negatively correlated.
Interest Rate: In general, the default rate increases with interest rate. Note in the histogram how the ratio of defaulted loans to fully-paid loans increases as interest rate increases. The lmplot also shows how defaults and interest rate are positively correlated.

Additional Views¶

In this last section of visualizations, we'll explore relationships between a few different features, and we'll compare the behavior of defaults to fully paid loans across different subpopulations of the data.

let's start by investigating the trend between int_rate and fico. In general, we would expect interest rate to decrease as fico increases. Not only does this make intuitive sense, but we see these features are negatively correlated (-0.54) in our correlation plot above. Let's see if this trend waivers at all within the four term by loan_status subpopulations.

In [16]:

g = sns.jointplot(x = 'fico', y = 'int_rate', data = df,
                  color = 'purple', kind = 'kde', height = 10)
g.fig.suptitle("Interest Rate by Fico", fontsize = 30, fontweight = 'bold')
g.fig.subplots_adjust(top = 0.91)

h = sns.lmplot('fico', 'int_rate', df, row = 'loan_status', col = 'term',
               palette = 'Set1', height = 5)
h.fig.suptitle("Subpopulations: Term by Loan Status", fontsize = 20, fontweight = 'bold')
h.fig.subplots_adjust(top = 0.9)

plt.show()

Observations

Okay cool, nothing out of the ordinary here. Across all term and loan status subpopulations, we see a consistent decrease in interest rate as fico increases.

Now, let's investigate the trend between int_rate and annual_inc. Assuming that borrowers who make more money are less likely to default, we would expect interest rate to decrease as income increases. However, per the correlation plot above, it appears that interest rate has almost no correlation with annual income whatsoever (-0.01). This seems odd, given it makes intuitive sense that higher income means less risk... Let's see if this lack of correlation is consistent within term by loan_status subpopulations as well.

Note: annual_inc is highly skewed. To best visualize trends in the data, we will use the log of annual income (log_annual_inc).

In [17]:

df['log_annual_inc'] = np.log(df['annual_inc'])

g = sns.lmplot('log_annual_inc', 'int_rate', df, height = 5,
               aspect = 2, palette = 'coolwarm', col = 'term')
g.fig.suptitle("Interest Rate by Log(Annual Income)", fontsize = 25, fontweight = 'bold')
g.fig.subplots_adjust(top = 0.75)


h = sns.lmplot('log_annual_inc', 'int_rate', df, hue = 'loan_status', height = 5,
               aspect = 2, palette = 'coolwarm', col = 'term')
h.fig.suptitle("Loan Status Breakout", fontsize = 20, fontweight = 'bold')
h.fig.subplots_adjust(top = 0.8)

plt.show()

df.drop(['log_annual_inc'], axis = 1, inplace = True)

Observations

For 36-month loans, the interest rate tends to decrease as income increases.
For 60-month loans the interest rate tends to increase as income increases.

I'm not sure why this would be... putting a pin in this for now.

Finally, let's take a look at loan_amnt by term.

In [18]:

plt.figure(figsize = (12.5, 2))
g = sns.boxplot(x = 'loan_amnt', y = 'term', data = df)
g.set_xlabel(None)
g.set_ylabel(None)
g.set(xticklabels=[], yticklabels=[])
g.set(xticks=[], yticks=[])
plt.suptitle("Distribution of Loan Amount", fontsize = 20, fontweight = 'bold', position = (.52, 1.2))

fig = sns.FacetGrid(df, hue = 'term', aspect = 2.5, height = 5)
fig.map(sns.kdeplot, 'loan_amnt', shade = True)
fig.set(xlim = (0, df['loan_amnt'].max()), yticks=[])
fig.add_legend()
plt.show()

print('\n')

df['loan_status_term'] =  df['term'] + df['loan_status'].apply(lambda x: ' default' if x==1 else ' fully paid')

plt.figure(figsize = (12.5, 2))
h = sns.boxplot(x = 'loan_amnt', y = 'loan_status_term', data = df, order = sorted(df['loan_status_term'].unique()))
h.set_xlabel(None)
h.set_ylabel(None)
h.set(xticklabels=[], yticklabels=[])
h.set(xticks=[], yticks=[])
plt.suptitle("Loan Status Breakout", fontsize = 18, fontweight = 'bold', position = (.51, 1.2))

fig = sns.FacetGrid(df, hue = 'loan_status_term', aspect = 2.5, height = 5)
fig.map(sns.kdeplot, 'loan_amnt', shade = True)
fig.set(xlim = (0, df['loan_amnt'].max()), yticks=[])
fig.add_legend(label_order = sorted(df['loan_status_term'].unique()))
plt.show()

df.drop(['loan_status_term'], axis = 1, inplace = True)

Observations

Overall, the loan amount for 60-month loans tends to be much higher than for 36-month loans. The average 60-month loan is for about $20K, whereas the average 36-month loan is about half that much.
Few 36-month loans exceed $25K.
The distribution of defaulted loans is very similar to fully paid loans, regardless of term.

Additional Preprocessing¶

Note: Tree-based algorithms are not sensitive to feature magnitudes. So standardizing our data (e.g. scaling and normalizing) is not necessary before fitting our three models. Here's a helpful article that explains this in more detail: When and Why to Standardize Your Data. However, at the end of this notebook I'd like to compare the results of my tree-based models to a neural network, for which standardizing data is recommended (link). We'll standardize the data now to keep everything consistent for all models.

Box-Cox¶

First, we'll normalize our features with a Box-Cox transformation.

In [19]:

# Boxcox transform
from scipy.stats import boxcox

numerical = df.columns[df.dtypes == 'float64']
for i in numerical:
    if df[i].min() > 0:
        transformed, lamb = boxcox(df.loc[df[i].notnull(), i])
        if np.abs(1 - lamb) > 0.02:
            df.loc[df[i].notnull(), i] = transformed

Dummy Variables¶

Next, let's create dummy variables for our categorical features.

In [20]:

df_final = pd.get_dummies(df, drop_first = True)
df_final.head(3)

Out[20]:

	loan_status	int_rate	dti	loan_amnt	annual_inc	fico	cr_hist_days	term_ 60 months	home_ownership_RENT	purpose_credit_card	purpose_small_business	emp_length_10+ years	emp_length_< 1 year
0	0	5.004289	27.65	66.423672	6.890360	0.103197	9830	0	1	1	0	1	0
1	1	6.546006	1.00	50.252940	6.988011	0.103197	4627	1	1	0	0	0	1
2	0	6.757855	8.72	49.428550	6.585215	0.103197	3682	0	1	0	1	1	0

Scaling¶

Great! Now, let's define our train and test populations and scale our data.

In [21]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [22]:

X = df_final.drop('loan_status', axis = 1)
y = df_final['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state = 13)

# Scale data
sc = StandardScaler()
numerical = X_train.columns[(X_train.dtypes =='float64') | (X_train.dtypes == 'int64')].tolist()
X_train[numerical] = sc.fit_transform(X_train[numerical])
X_test[numerical] = sc.transform(X_test[numerical])

X_train = X_train.values
y_train = y_train.values
X_test = X_test.values
y_test = y_test.values

Confusion Matrix Function¶

Before we proceed with training & testing our classifiers, let's create a function to cleanly plot a confusion matrix.

In [23]:

# Define function to plot confusion matrix

import itertools

def plot_cm(cm, classes, normalize = False, title = 'Confusion matrix', cmap = 'Blues'):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]
        print('Normalized confusion matrix')
    else:
        print('Confusion matrix, without normalization')
    plt.imshow(cm, interpolation = 'nearest', cmap = cmap)
    plt.title(title, fontsize = 14)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation = 45)
    plt.yticks(tick_marks, classes)
    plt.ylabel('Actual')
    plt.xlabel('Predicted')

    fmt = '.4f' if normalize else 'd'
    thresh = cm.max() / 1.5
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment = "center",
                 color = "white" if cm[i, j] > thresh else "black")
    plt.tight_layout()

And last, let's check our class imbalance.

In [24]:

df['loan_status'].value_counts()/len(df)

Out[24]:

0    0.84441
1    0.15559
Name: loan_status, dtype: float64

Tree-Based Classifiers¶

Alright, we're ready to build our tree-based classifiers! We'll build a simple Decision Tree, a Random Forest, and an XGBoosted Tree.

Let's start with our Decision Tree.

Decision Tree¶

First, a few implementation details:

SMOTE Oversampling: There is a significant class imbalance in our data (84% Fully Paid, 16% Defaulted). It is important that we create a 50/50 split in our training data because otherwise we risk the model overfitting to the majority class (i.e. always predicting "Fully Paid" would give 84% accuracy). Another way to think about this is that we want to train our model to recognize the characteristics of defaults, rather than assume that most loans will pay in full.

To create this 50/50 split in our training data, we'll oversample the defaulted loans using the SMOTE technique. With SMOTE, we increase the number of loans in the minority class (i.e. Defaulted) through creating synthetic samples in between existing ones that are in close proximity. Synthetic loans continue to be created until the counts of both the majority class (Fully Paid) and the minority class (Defaulted) are equal.

GridSearchCV: To optimize our model's performance, we'll tune our hyperparameters by looking at all combinations with GridSearchCV.

make_pipeline: We'll create a pipeline with make_pipeline to train our model on SMOTE oversampled data and cross-validate our results using GridSearchCV to ensure that we choose the best hyperparameters.

In [60]:

# Import libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline

# Define hyperparameter values to grid search
tree_params = {"criterion" : ["gini", "entropy"],
               "max_depth" : [4],
               "min_samples_leaf" : [2],
               "min_samples_split" : [2]}

# Tune hyperparameters with GridSearchCV
grid_tree = GridSearchCV(estimator = DecisionTreeClassifier(),
                               param_grid = tree_params,
                               cv = 3,
                               verbose = 0)

# Train Decision Tree on data balanced with SMOTE oversampling
pipeline = make_pipeline(SMOTE(sampling_strategy = 'minority'), grid_tree)
pipeline.fit(X_train, y_train)

# Choose best hyperparamters and make predictions
grid_tree_best = grid_tree.best_estimator_
predictions = grid_tree_best.predict(X_test)

# Display results
print(classification_report(y_test, predictions))
print("=" * 60)
tree_cm = confusion_matrix(y_test, predictions)
labels = ['Fully Paid', 'Defaulted']

plt.figure(figsize = (6, 5))
plot_cm(cm = tree_cm, classes = labels, title = "Decision Tree\nConfusion Matrix", normalize = True)
plt.show()

              precision    recall  f1-score   support

           0       0.89      0.72      0.80     56027
           1       0.25      0.51      0.34     10402

    accuracy                           0.69     66429
   macro avg       0.57      0.62      0.57     66429
weighted avg       0.79      0.69      0.72     66429

============================================================
Normalized confusion matrix

In [61]:

plt.figure(figsize = (6, 5))
plot_cm(cm = tree_cm, classes = labels, title = "Decision Tree\nConfusion Matrix", cmap = 'Greens')

Confusion matrix, without normalization

Alright, looks like our Decision Tree model was able to recognize 51% of defaults and 72% of fully paid loans. Another way to look at this is that our model sacrificed 15,702 safe loans to identify 5,337 defaults.

Let's take a look at the optimal hyperparameters chosen by GridSearchCV:

In [62]:

grid_tree.best_params_

Out[62]:

{'criterion': 'entropy',
 'max_depth': 4,
 'min_samples_leaf': 2,
 'min_samples_split': 2}

Looks like entropy outperformed gini for measuring impurity. The model also chose our highest max_depth value (4), our lowest min_samples_leaf value (2), and our lowest min_samples_split value (2).

Let's see which features were most important in our model. To do this, we'll use the attribute .featureimportances, which returns the importance of features by computing their (normalized) total reduction in impurity. Here's more detail on how feature importances are measured%20calculates%20each,number%20of%20samples%20it%20splits.).

In [63]:

tree_model = grid_tree_best

feat = pd.DataFrame(columns = ['Feature', 'Importance'])
feat['Feature'] = X.columns
feat['Importance'] = tree_model.feature_importances_
feat.sort_values(by = 'Importance', ascending = False, inplace = True)

plt.figure(figsize = (10, 6))
g = sns.barplot(x = 'Feature', y = 'Importance', palette = 'Greens_r',
                data = feat[feat['Importance'] != 0])
g.set_xticklabels(g.get_xticklabels(), rotation = 30)
g.set_ylabel('Relative Importance')
g.set_title(label = "Decision Tree Feature Importance", fontsize = 18,
            fontweight = 'bold', pad = 20)
plt.show()

Interest rate was by far the most important feature for predicting defaults... which makes sense. Term also played a role in prediction, which seems consistent with the stark difference in default rate we observed between 36 and 60 month loans earlier in our analysis.

Let's visualize the decision tree to see exactly how our model is making decisions.

In [64]:

# Visualize grid_tree_best
from sklearn import tree

plt.figure(figsize = (30, 10))
tree.plot_tree(grid_tree.best_estimator_,
               feature_names = df_final.drop('loan_status', axis = 1).columns,
               class_names = ['Fully Paid', 'Defaulted'],
               filled = True)
plt.show()

Great! We can see what a significant role interest rate plays in reducing entropy throughout our model. For more on how to read/visualize decision trees, checkout this article.

Finally, let's create a dataframe to track the results of our models to compare their performances later on.

In [80]:

model_df = pd.DataFrame(columns = ['model', 'false_negatives', 'false_positives'])
model_df.loc[len(model_df)] = ['Baseline', sum(y_test), 0]
model_df.loc[len(model_df)] = ['Decision Tree', tree_cm[1, 0], tree_cm[0, 1]]
model_df

Out[80]:

	model	false_negatives	false_positives
0	Baseline	10402	0
1	Decision Tree	5065	15702

Random Forest¶

Awesome! Now let's train and test our Random Forest model.

A Random Forest classification model consists of many Decision Trees. It performs classification by choosing the most common class predicted by its trees. Understanding Random Forests

RandomizedSearchCV: To optimize the performance of our Random Forest, we'll tune our hyperparameters with RandomizedSearchCV instead of GridSearchCV. With GridSearchCV, every combination of hyperparameters are tried. RandomizedSearchCV, on the other hand, tries only a predefined number (n_iter) of random combinations of hyperparameters. Using RandomizedSearchCV is preferable here because testing every combination of hyperparameters would become too costly / take too much time. RandomizedSearchCV will also be preferable for training our XGBoost model, where we'll want to define certain hyperparamters as continuous distributions. This article gives a good, high-level comparison of GridSearch and RandomizedSearch.

In [78]:

# Import libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Define hyperparameter values to random search
rf_params = {'n_estimators' : [100, 300, 600, 1000],
             'max_features' : ['auto', 'log2'],
             'max_depth' : randint(2, 5),
             'min_samples_split' : randint(2, 5),
             'min_samples_leaf' : randint(2, 5),
             'bootstrap' : [True, False]}

# Tune hyperparameters with RandomizedSearchCV
grid_rf = RandomizedSearchCV(estimator = RandomForestClassifier(),
                             param_distributions = rf_params,
                             n_iter = 200,
                             cv = 3,
                             verbose = 0,
                             random_state = 13,
                             n_jobs = -1)

# Train Random Forest on data balanced with SMOTE oversampling
pipeline = make_pipeline(SMOTE(sampling_strategy = 'minority'), grid_rf)
pipeline.fit(X_train, y_train)

# Choose best hyperparamters and make predictions
grid_rf_best = grid_rf.best_estimator_
predictions = grid_rf_best.predict(X_test)

# Display results
print(classification_report(y_test, predictions))
print("=" * 60)
rf_cm = confusion_matrix(y_test, predictions)
labels = ['Fully Paid', 'Defaulted']
plt.figure(figsize = (6, 5))
plot_cm(cm = rf_cm, classes = labels, title = "Random Forest\nConfusion Matrix", normalize = True)
plt.show()

              precision    recall  f1-score   support

           0       0.90      0.65      0.76     56027
           1       0.24      0.61      0.35     10402

    accuracy                           0.64     66429
   macro avg       0.57      0.63      0.55     66429
weighted avg       0.80      0.64      0.69     66429

============================================================
Normalized confusion matrix

In [79]:

plt.figure(figsize = (6, 5))
plot_cm(cm = rf_cm, classes = labels, title = "Random Forest\nConfusion Matrix", cmap = 'Purples')

Confusion matrix, without normalization

Good, looks like our Random Forest model was able to recognize 61% of defaults and 65% of fully paid loans. Another way to look at this is that our model sacrificed 19,532 safe loans to identify 6,323 defaults.

Let's take a look at the optimal hyperparameters chosen by RandomizedSearchCV:

In [87]:

grid_rf.best_params_

Out[87]:

{'n_estimators': 600,
 'min_samples_split': 3,
 'min_samples_leaf': 4,
 'max_features': 'log2',
 'max_depth': 4,
 'bootstrap': True}

Looks like bootstrapping performed best, and that the log2 method of selecting features to consider at each split also performed best. The model chose our highest max_depth value (4), our highest min_samples_leaf value (4), and a min_samples_split value of (3). Finally, our Random Forest performed best when including 600 decision trees.

Let's see which features were most important in our model.

In [34]:

tree_model = grid_rf_best

feat = pd.DataFrame(columns = ['Feature', 'Importance'])
feat['Feature'] = X.columns
feat['Importance'] = tree_model.feature_importances_
feat.sort_values(by = 'Importance', ascending = False, inplace = True)

plt.figure(figsize = (10, 6))
g = sns.barplot(x = 'Feature', y = 'Importance', palette = 'Purples_r',
                data = feat[feat['Importance'] != 0])
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
g.set_ylabel('Relative Importance')
g.set_title(label = "Random Forest Feature Importance", fontsize = 18,
            fontweight = 'bold', pad = 20)
plt.show()

Once again, we see that interest rate and loan term are the two most important features, respectively. In our Random Forest, however, term seems to play a larger role than it did in our simple Decision Tree model. We also see a handful of additional features with lower importance (loan_amount, home_ownership, dti etc.)

In [81]:

model_df.loc[len(model_df)] = ['Random Forest', rf_cm[1, 0], rf_cm[0, 1]]
model_df

Out[81]:

	model	false_negatives	false_positives
0	Baseline	10402	0
1	Decision Tree	5065	15702
2	Random Forest	4079	19532

XGBoost¶

Great! Now we're ready to train and test our XGBoost Tree model.

XGBoosted Trees are similar to Random Forests, in that both models combine the results of a set of Decision Trees. XGBoost differs from Random Forest in the way it builds those Decision Trees. Random Forests build each tree independently. XGBoost builds one tree at a time in a forward stage-wise manner, introducing a weak learner after each tree is built to improve the shortcomings of existing weak learners. More on this.

By carefully tuning parameters, gradient boosting can result in better performance than random forests. However, gradient boosting may not be a good choice if you have a lot of noise, as it can result in overfitting. It also tends to be harder to tune than random forests.

In [36]:

# Import libraries
import xgboost as xgb
from scipy.stats import uniform

# Define hyperparameter values to random search
xgb_params = {'n_estimators' : randint(50, 300),
              'max_depth' : randint(2, 5),
              'min_samples_split' : randint(2, 5),
              'min_samples_leaf' : randint(2, 5),
              'min_child_weight' : uniform(loc = 1, scale = 0.5),
              'gamma' : uniform(loc = 0.6, scale = 0.4),
              'reg_lambda' : uniform(loc = 1, scale = 2),
              'reg_alpha' : uniform(loc = 0, scale = 1),
              'learning_rate' : uniform(loc = .001, scale = .009)}

# Tune hyperparameters with RandomizedSearchCV
grid_xgb = RandomizedSearchCV(estimator = xgb.XGBClassifier(),
                              param_distributions = xgb_params,
                              n_iter = 250,
                              cv = 3,
                              verbose = 0,
                              random_state = 13,
                              n_jobs = -1)

# Train XGBoost on data balanced with SMOTE oversampling
pipeline = make_pipeline(SMOTE(sampling_strategy = 'minority'), grid_xgb)
pipeline.fit(X_train, y_train)

# Choose best hyperparamters and make predictions
grid_xgb_best = grid_xgb.best_estimator_
predictions = grid_xgb_best.predict(X_test)

# Display results
print(classification_report(y_test, predictions))
print("=" * 60)
xgb_cm = confusion_matrix(y_test, predictions)
labels = ['Fully Paid', 'Defaulted']
plt.figure(figsize = (6, 5))
plot_cm(cm = xgb_cm, classes = labels, title = "XGBoost\nConfusion Matrix", normalize = True)
plt.show()

[04:29:08] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { min_samples_leaf, min_samples_split } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


              precision    recall  f1-score   support

           0       0.89      0.68      0.77     56027
           1       0.25      0.57      0.34     10402

    accuracy                           0.66     66429
   macro avg       0.57      0.62      0.56     66429
weighted avg       0.79      0.66      0.70     66429

============================================================
Normalized confusion matrix

In [37]:

plt.figure(figsize = (6,5))
plot_cm(cm = xgb_cm, classes = labels, title = "XGBoost\nConfusion Matrix", cmap = 'Reds')

Confusion matrix, without normalization

Great, looks like our XGBoost model was able to recognize 57% of defaults and 68% of fully paid loans. Another way to look at this is that our model sacrificed 18,125 safe loans to identify 5,919 defaults.

Let's take a look at the optimal hyperparameters chosen by RandomizedSearchCV, as well as the features that were most important in our model:

In [38]:

grid_xgb.best_params_

Out[38]:

{'gamma': 0.9701785202144155,
 'learning_rate': 0.009272734287074282,
 'max_depth': 4,
 'min_child_weight': 1.344238993697839,
 'min_samples_leaf': 4,
 'min_samples_split': 2,
 'n_estimators': 293,
 'reg_alpha': 0.9996842886452724,
 'reg_lambda': 1.6056878845383316}

In [39]:

tree_model = grid_xgb_best

feat = pd.DataFrame(columns = ['Feature', 'Importance'])
feat['Feature'] = X.columns
feat['Importance'] = tree_model.feature_importances_
feat.sort_values(by = 'Importance', ascending = False, inplace = True)

plt.figure(figsize = (10,6))
g = sns.barplot(x = 'Feature', y = 'Importance', palette = 'Reds_r',
                data = feat[feat['Importance'] != 0])
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
g.set_ylabel('Relative Importance')
g.set_title(label = "XGBoost Feature Importance", fontsize = 18,
            fontweight = 'bold', pad = 20)
plt.show()

Interesting! Term is now at the top of the list, followed by type of homeownership, followed by interest rate. Purpose = debt consolidation also played a significant role.

In [82]:

model_df.loc[len(model_df)] = ['XGBoost', xgb_cm[1, 0], xgb_cm[0, 1]]
model_df

Out[82]:

	model	false_negatives	false_positives
0	Baseline	10402	0
1	Decision Tree	5065	15702
2	Random Forest	4079	19532
3	XGBoost	4483	18125

Neural Network¶

Now let's implement a simple Neural Network to see how it performs against our tree-based models!

To create our Neural Network, let's have one input layer with the same number of nodes as features plus a bias node, a second hidden layer with 38 nodes, and one output node classifying the loan as 0 (Fully Paid) or 1 (Defaulted).

In [41]:

import itertools
import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers import Activation
from keras.layers.core import Dense
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy

In [42]:

sm = SMOTE(sampling_strategy = 'minority', random_state = 13)
X_train_sm, y_train_sm = sm.fit_sample(X_train, y_train)

NN = Sequential([Dense(X_train.shape[1], input_shape = (X_train.shape[1], ), activation = 'relu'),
                 Dense(38, activation = 'relu'),
                 Dense(2, activation = 'softmax')])

NN.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
dense (Dense)                (None, 36)                1332
_________________________________________________________________
dense_1 (Dense)              (None, 38)                1406
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 78
=================================================================
Total params: 2,816
Trainable params: 2,816
Non-trainable params: 0
_________________________________________________________________

In [43]:

NN.compile(Adam(lr = 0.001), metrics = ['accuracy'], loss = 'sparse_categorical_crossentropy')

NN.fit(X_train_sm, y_train_sm, validation_split = 0.2, batch_size = 250, epochs = 25,
       shuffle = True, verbose = 2)

pred = NN.predict_classes(X_test)

Epoch 1/25
839/839 - 1s - loss: 0.6103 - accuracy: 0.6595 - val_loss: 0.8870 - val_accuracy: 0.3590
Epoch 2/25
839/839 - 1s - loss: 0.6020 - accuracy: 0.6682 - val_loss: 0.9421 - val_accuracy: 0.3353
Epoch 3/25
839/839 - 1s - loss: 0.5995 - accuracy: 0.6706 - val_loss: 0.8495 - val_accuracy: 0.4169
Epoch 4/25
839/839 - 1s - loss: 0.5974 - accuracy: 0.6716 - val_loss: 0.8592 - val_accuracy: 0.4213
Epoch 5/25
839/839 - 1s - loss: 0.5955 - accuracy: 0.6740 - val_loss: 0.8130 - val_accuracy: 0.4760
Epoch 6/25
839/839 - 1s - loss: 0.5942 - accuracy: 0.6744 - val_loss: 0.8410 - val_accuracy: 0.4393
Epoch 7/25
839/839 - 1s - loss: 0.5925 - accuracy: 0.6760 - val_loss: 0.8439 - val_accuracy: 0.4258
Epoch 8/25
839/839 - 1s - loss: 0.5910 - accuracy: 0.6773 - val_loss: 0.8496 - val_accuracy: 0.4275
Epoch 9/25
839/839 - 1s - loss: 0.5895 - accuracy: 0.6790 - val_loss: 0.8021 - val_accuracy: 0.4807
Epoch 10/25
839/839 - 1s - loss: 0.5880 - accuracy: 0.6799 - val_loss: 0.8205 - val_accuracy: 0.4562
Epoch 11/25
839/839 - 1s - loss: 0.5865 - accuracy: 0.6822 - val_loss: 0.9166 - val_accuracy: 0.3623
Epoch 12/25
839/839 - 1s - loss: 0.5850 - accuracy: 0.6826 - val_loss: 0.8066 - val_accuracy: 0.4792
Epoch 13/25
839/839 - 1s - loss: 0.5840 - accuracy: 0.6837 - val_loss: 0.8184 - val_accuracy: 0.4659
Epoch 14/25
839/839 - 1s - loss: 0.5823 - accuracy: 0.6852 - val_loss: 0.8772 - val_accuracy: 0.3973
Epoch 15/25
839/839 - 1s - loss: 0.5809 - accuracy: 0.6865 - val_loss: 0.8303 - val_accuracy: 0.4716
Epoch 16/25
839/839 - 1s - loss: 0.5800 - accuracy: 0.6877 - val_loss: 0.8190 - val_accuracy: 0.4653
Epoch 17/25
839/839 - 1s - loss: 0.5786 - accuracy: 0.6899 - val_loss: 0.8657 - val_accuracy: 0.4170
Epoch 18/25
839/839 - 1s - loss: 0.5773 - accuracy: 0.6903 - val_loss: 0.8040 - val_accuracy: 0.4834
Epoch 19/25
839/839 - 1s - loss: 0.5762 - accuracy: 0.6906 - val_loss: 0.7489 - val_accuracy: 0.5455
Epoch 20/25
839/839 - 1s - loss: 0.5751 - accuracy: 0.6924 - val_loss: 0.8648 - val_accuracy: 0.4243
Epoch 21/25
839/839 - 1s - loss: 0.5742 - accuracy: 0.6931 - val_loss: 0.8014 - val_accuracy: 0.4964
Epoch 22/25
839/839 - 1s - loss: 0.5730 - accuracy: 0.6933 - val_loss: 0.8173 - val_accuracy: 0.4535
Epoch 23/25
839/839 - 1s - loss: 0.5716 - accuracy: 0.6944 - val_loss: 0.8510 - val_accuracy: 0.4292
Epoch 24/25
839/839 - 1s - loss: 0.5709 - accuracy: 0.6960 - val_loss: 0.7756 - val_accuracy: 0.5166
Epoch 25/25
839/839 - 1s - loss: 0.5703 - accuracy: 0.6966 - val_loss: 0.7612 - val_accuracy: 0.5353
WARNING:tensorflow:From <ipython-input-43-7fbea7402709>:6: Sequential.predict_classes (from tensorflow.python.keras.engine.sequential) is deprecated and will be removed after 2021-01-01.
Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).

In [44]:

NN_cm = confusion_matrix(y_test, pred)
labels = ['Fully Paid', 'Defaulted']

plt.figure(figsize = (6, 5))
plot_cm(cm = NN_cm, classes = labels, title = "NN\nConfusion Matrix", normalize = True)

Normalized confusion matrix

In [45]:

plt.figure(figsize = (6, 5))
plot_cm(cm = NN_cm, classes = labels, title = "NN\nConfusion Matrix", cmap = 'Oranges')

Confusion matrix, without normalization

Awesome, looks like our Neural Network was able to recognize 39% of defaults and 80% of fully paid loans. Another way to look at this is that our model sacrificed 11,183 safe loans to identify 4,025 defaults.

In [83]:

model_df.loc[len(model_df)] = ['Neural Network', NN_cm[1, 0], NN_cm[0, 1]]
model_df

Out[83]:

	model	false_negatives	false_positives
0	Baseline	10402	0
1	Decision Tree	5065	15702
2	Random Forest	4079	19532
3	XGBoost	4483	18125
4	Neural Network	6377	11183

Conclusion¶

We're ready to compare the results of our models.

Evaluating Performance: A perfect model would be able to predict all loans correctly, allowing us to avoid all future defaults and only accept loans that will pay in full. While that's a nice thought, none of our models are perfect. Instead, we want to find which of our models does the best job of minimizing total loss. Loss comes from misclassifications. There are two types:

Misclassifying loans that will default: When our model approves a loan that will default in the future, we lose the remaining balance plus interest to be paid. These are false_negatives.
Misclassifying loans that will pay in full: When our model denies loans that will pay in full, we lose the interest we could have made on that loan. These are false_positives.

We can calculate our total loss as:

Total Loss = (false_negatives * false_negative_avg_cost) + (false_positives * false_positive_avg_cost)

While we don't know the average cost of false negatives or the average cost of false positives, we can look at the ratio between the two to evalute the above equation. Using this ratio allows us to simplify the above equation as follows:

loss_ratio = false_positive_avg_cost / false_negative_avg_cost

=>    Total Loss = (false_negatives * false_negative_avg_cost) +
                   (false_positives * false_negative_avg_cost * loss_ratio)

=>    Total Loss / false_negative_avg_cost = false_negatives + false_positives * loss_ratio

The best model is the one that minimizes false_negatives + false_positives * loss_ratio for a given loss_ratio.

It makes sense that false negatives are more costly than false positives (i.e. we lose more misclassifying defaults than safe loans). Thus, we'll just look at loss_ratios between 0 and 1 to evaluate model performance.

In [147]:

loss_ratio = np.linspace(0, 1, 1000) #false_positive_avg_cost / false_negative_avg_cost

# plot total loss for all models across for loss_ratio 0-1.
plt.figure(figsize = (16, 10))

for i in range(len(model_df)):
    x = loss_ratio
    y = model_df['false_negatives'][i] * np.ones(len(loss_ratio)) + model_df['false_positives'][i] * loss_ratio
    plt.plot(x, y)

plt.xticks(np.arange(0, 1, 0.02), rotation = 90)
plt.grid()
plt.xlabel('\nLoss Ratio\n\n(cost of misclassifying a safe loan) /\n(cost of misclassifying a default)')
plt.ylabel('Total Loss\n\n(divided by avg cost of misclassifying a default)')
plt.title('\nComparing Model Performance: Total Loss', fontsize = 20, pad = 20, fontweight = 'bold')
labels = model_df['model']
plt.legend(labels)


plt.show()

Great! Looking at the minimum values in the graph above, we should choose the following model under each loss_ratio condition.

loss_ratio < 27% -- Random Forest: Choose the Random model if the expected loss from misclassifying a safe loan is less than 28% the expected loss from misclassifying a default. Note that XGBoost closes the gap on the Random Forest as we approach the 27% loss_ratio, and outperforms it above this level.
27% <= loss_ratio < 36% -- Neural Network: Choose the Neural Network model if the expected loss from misclassifying a safe loan is between 27% and 36% the expected loss from misclassifying a default.
36% <= loss_ratio -- Baseline: Choose the Baseline model (assume all loans are safe) if the expected loss from misclassifying a safe loan is at least 36% the expected loss from misclassifying a default.

For no loss_ratio was the Decision Tree model the best choice. Why Random Forests Outperform Decision Trees

Final Thoughts¶

So what's a reasonable loss_ratio to assume? Well, that depends on how much we expect to make on a safe loan vs how much we expect to lose on a default. While both of these values depend on a wide range of factors, we can (grossly) oversimplify our estimation as follows

Assumptions
- $1000 loan amount
- 5-yr loan
- Assume 5% incremental annual return (LendingClub takes a cut, and considering alternative investment strategies like a 5-yr CD)
- Assume we only make half our money back if a default occurs.
Amount we expect to make if loan is fully paid (link to calculator) = $132.27
Amount we expect to lose on default = $566.14

loss_ratio = 132.27 / 566.14 = 23.4%

In this case, our best model would be the Random Forest.

Thank you!¶

Thanks for going on this journey with me! If you have any questions, suggestions for improving upon my approach, or just like my work, please don't hesitate to reach out at keilordykengilbert@gmail.com. Having a conversation is the best way for everyone involved to learn & improve 🌲😊🌲

In [148]:

! jupyter nbconvert --to html Personal_Loans_and_Decision_Trees.ipynb

[NbConvertApp] Converting notebook Personal_Loans_and_Decision_Trees.ipynb to html
[NbConvertApp] Writing 4344996 bytes to Personal_Loans_and_Decision_Trees.html

In [ ]: