Personal Loans & Tree-Based Models 🌲

In this project, we'll be investigating historical loan data, with the ultimate goal of developing the best tree-based classification model to separate defaults from safe loans. We'll compare the performances of Decision Trees, Random Forests, Gradient Boosted Trees, and Neural Networks.

The Data

We'll be using historical personal loan data from LendingClub.com. LendingClub is a P2P lending platform that connects people who need money (borrowers) with people who have money (investors). As an investor, you would want to invest in borrowers who show a profile with a low probability of defaulting.

The data is from 2010-2013. I chose loans from this time range for two reasons. First, the maximum term for LendingClub loans is 5 years, so we would have realized all defaults by the date this data was pulled (early 2020). Second, I wanted to avoid including loans heavily affected by the Great Recession (2007-2009).

Project Inspiration

There are a few reasons why I decided to do this project.

  1. I used to work as a risk analyst for a marketplace lender that refinanced student loans. In my job, I built the company's underwriting and pricing models. While I evaluated applicants on factors like FICO score, DTI, free cash flow, and delinquencies to approve these loans, many of the cutoffs were predetermined by our financing partners, and I never actually built a predictive ML model.

  2. I'm eager to get a bit more experience working with different tree classifiers. I'd like to improve my understanding of the strengths and weaknesses of each model, and see how each performs under different hyperparameters. Finally, I'd like to create a strategy that identifies the best model for different cost assumptions (cost of misclassifying defaults, cost of misclassifying safe loans).

  3. I'm considering becoming an investor on LendingClub's platform. I'd like to have a perspective on whether a loan will default, and building a classification model with LendingClub's historical loan data will help me do that.

Approach

We'll take a systematic approach to exploring our data and developing our models.

  1. Import Data
  2. Data Preprocessing
    • Remove Unnecessary Columns
    • Null Values
    • Feature Engineering
    • Exploring Our Target Variable
  3. Correlations
  4. Visualizations
    • Feature Distributions
    • Nominal & Correlated Features
    • Additional Views
  5. Additional Preprocessing
    • Box-Cox
    • Dummy Variables
    • Scaling
    • Confusion Matrix Function
  6. Tree-Based Classifiers
    • Decision Tree
    • Random Forest
    • XGBoost
  7. Neural Network
  8. Conclusion
    • Final Thoughts

If you have any questions, suggestions for improving upon my approach, or just like my work, please don't hesitate to reach out at keilordykengilbert@gmail.com. Having a conversation is the best way for everyone involved to learn & improve 😊

Alright! I'm excited to revisit my old lending stomping grounds, gain experience working with tree-based models, and build the best predictive strategy for my investing needs. Let's get started!!!

Import Data

In [1]:
# Import useful libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")
In [2]:
pd.set_option('display.max_columns', None)

df = pd.read_csv('loans_2010_to_2013.csv')
df.head()
Out[2]:
Unnamed: 0 id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan url desc purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line fico_range_low fico_range_high inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d last_fico_range_high last_fico_range_low collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type annual_inc_joint dti_joint verification_status_joint acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_act_il open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m acc_open_past_24mths avg_cur_bal bc_open_to_buy bc_util chargeoff_within_12_mths delinq_amnt mo_sin_old_il_acct mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op mo_sin_rcnt_tl mort_acc mths_since_recent_bc mths_since_recent_bc_dlq mths_since_recent_inq mths_since_recent_revol_delinq num_accts_ever_120_pd num_actv_bc_tl num_actv_rev_tl num_bc_sats num_bc_tl num_il_tl num_op_rev_tl num_rev_accts num_rev_tl_bal_gt_0 num_sats num_tl_120dpd_2m num_tl_30dpd num_tl_90g_dpd_24m num_tl_op_past_12m pct_tl_nvr_dlq percent_bc_gt_75 pub_rec_bankruptcies tax_liens tot_hi_cred_lim total_bal_ex_mort total_bc_limit total_il_high_credit_limit revol_bal_joint sec_app_fico_range_low sec_app_fico_range_high sec_app_earliest_cr_line sec_app_inq_last_6mths sec_app_mort_acc sec_app_open_acc sec_app_revol_util sec_app_open_act_il sec_app_num_rev_accts sec_app_chargeoff_within_12_mths sec_app_collections_12_mths_ex_med sec_app_mths_since_last_major_derog hardship_flag hardship_type hardship_reason hardship_status deferral_term hardship_amount hardship_start_date hardship_end_date payment_plan_start_date hardship_length hardship_dpd hardship_loan_status orig_projected_additional_accrued_interest hardship_payoff_balance_amount hardship_last_payment_amount disbursement_method debt_settlement_flag debt_settlement_flag_date settlement_status settlement_date settlement_amount settlement_percentage settlement_term
0 1611879 1077501 NaN 5000.0 5000.0 4975.0 36 months 10.65 162.87 B B2 NaN 10+ years RENT 24000.0 Verified Dec-2011 Fully Paid n https://lendingclub.com/browse/loanDetail.acti... Borrower added on 12/22/11 > I need to upgra... credit_card Computer 860xx AZ 27.65 0.0 Jan-1985 735.0 739.0 1.0 NaN NaN 3.0 0.0 13648.0 83.7 9.0 f 0.0 0.0 5863.155187 5833.84 5000.00 863.16 0.00 0.0 0.00 Jan-2015 171.62 NaN Dec-2018 749.0 745.0 0.0 NaN 1.0 Individual NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Cash N NaN NaN NaN NaN NaN NaN
1 1611880 1077430 NaN 2500.0 2500.0 2500.0 60 months 15.27 59.83 C C4 Ryder < 1 year RENT 30000.0 Source Verified Dec-2011 Charged Off n https://lendingclub.com/browse/loanDetail.acti... Borrower added on 12/22/11 > I plan to use t... car bike 309xx GA 1.00 0.0 Apr-1999 740.0 744.0 5.0 NaN NaN 3.0 0.0 1687.0 9.4 4.0 f 0.0 0.0 1014.530000 1014.53 456.46 435.17 0.00 122.9 1.11 Apr-2013 119.66 NaN Oct-2016 499.0 0.0 0.0 NaN 1.0 Individual NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Cash N NaN NaN NaN NaN NaN NaN
2 1611881 1077175 NaN 2400.0 2400.0 2400.0 36 months 15.96 84.33 C C5 NaN 10+ years RENT 12252.0 Not Verified Dec-2011 Fully Paid n https://lendingclub.com/browse/loanDetail.acti... NaN small_business real estate business 606xx IL 8.72 0.0 Nov-2001 735.0 739.0 2.0 NaN NaN 2.0 0.0 2956.0 98.5 10.0 f 0.0 0.0 3005.666844 3005.67 2400.00 605.67 0.00 0.0 0.00 Jun-2014 649.91 NaN Jun-2017 739.0 735.0 0.0 NaN 1.0 Individual NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Cash N NaN NaN NaN NaN NaN NaN
3 1611882 1076863 NaN 10000.0 10000.0 10000.0 36 months 13.49 339.31 C C1 AIR RESOURCES BOARD 10+ years RENT 49200.0 Source Verified Dec-2011 Fully Paid n https://lendingclub.com/browse/loanDetail.acti... Borrower added on 12/21/11 > to pay for prop... other personel 917xx CA 20.00 0.0 Feb-1996 690.0 694.0 1.0 35.0 NaN 10.0 0.0 5598.0 21.0 37.0 f 0.0 0.0 12231.890000 12231.89 10000.00 2214.92 16.97 0.0 0.00 Jan-2015 357.48 NaN Apr-2016 604.0 600.0 0.0 NaN 1.0 Individual NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Cash N NaN NaN NaN NaN NaN NaN
4 1611883 1075358 NaN 3000.0 3000.0 3000.0 60 months 12.69 67.79 B B5 University Medical Group 1 year RENT 80000.0 Source Verified Dec-2011 Fully Paid n https://lendingclub.com/browse/loanDetail.acti... Borrower added on 12/21/11 > I plan on combi... other Personal 972xx OR 17.94 0.0 Jan-1996 695.0 699.0 0.0 38.0 NaN 15.0 0.0 27783.0 53.9 38.0 f 0.0 0.0 4066.908161 4066.91 3000.00 1066.91 0.00 0.0 0.00 Jan-2017 67.30 NaN Apr-2018 684.0 680.0 0.0 NaN 1.0 Individual NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Cash N NaN NaN NaN NaN NaN NaN
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 222439 entries, 0 to 222438
Columns: 152 entries, Unnamed: 0 to settlement_term
dtypes: float64(115), int64(2), object(35)
memory usage: 258.0+ MB

Data Preprocessing

Remove Unnecessary Columns

The dataset has over 152 columns, many with duplicative and missing information. Let's just use the following for now:

  • loan_status (target variable): Whether the loan was paid off or charged off (defaulted).
  • int_rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
  • grade: Rating for the quality/riskiness of the loan (riskiness increases alphabetically).
  • sub_grade: Sub-rating for the quality/riskiness of the loan (riskiness increases numerically).
  • term: Length of time to payoff loan.
  • dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
  • loan_amnt: Total amount lended.
  • home_ownership: Applicant's housing type.
  • installment: The monthly installments owed by the borrower if the loan is funded.
  • purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
  • emp_length: Applicant's length of employment.
  • delinq_2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
  • earliest_cr_line: Start date of earliest credit line.
  • issue_d: Date loan was issued.
  • annual_inc: The self-reported annual income of the borrower.
  • fico_range_high, fico_range_low: upper and lower bound of borrower's fico score.
In [4]:
df = df[['loan_status', 'int_rate', 'grade', 'sub_grade', 'term',
         'dti', 'loan_amnt', 'home_ownership', 'installment', 'purpose',
         'emp_length', 'delinq_2yrs', 'earliest_cr_line', 'issue_d', 'annual_inc',
         'fico_range_low', 'fico_range_high']]

df.head()
Out[4]:
loan_status int_rate grade sub_grade term dti loan_amnt home_ownership installment purpose emp_length delinq_2yrs earliest_cr_line issue_d annual_inc fico_range_low fico_range_high
0 Fully Paid 10.65 B B2 36 months 27.65 5000.0 RENT 162.87 credit_card 10+ years 0.0 Jan-1985 Dec-2011 24000.0 735.0 739.0
1 Charged Off 15.27 C C4 60 months 1.00 2500.0 RENT 59.83 car < 1 year 0.0 Apr-1999 Dec-2011 30000.0 740.0 744.0
2 Fully Paid 15.96 C C5 36 months 8.72 2400.0 RENT 84.33 small_business 10+ years 0.0 Nov-2001 Dec-2011 12252.0 735.0 739.0
3 Fully Paid 13.49 C C1 36 months 20.00 10000.0 RENT 339.31 other 10+ years 0.0 Feb-1996 Dec-2011 49200.0 690.0 694.0
4 Fully Paid 12.69 B B5 60 months 17.94 3000.0 RENT 67.79 other 1 year 0.0 Jan-1996 Dec-2011 80000.0 695.0 699.0

Null Values

Now let's see how many null values we have and how best to address them.

In [5]:
# Create function to show number & share of nulls for each feature
def null_percentage(df):
    total = df.isnull().sum().sort_values(ascending = False)
    total = total[total != 0]
    percent = round(100 * total / len(df), 4)
    return pd.concat([total, percent], axis=1, keys=['Total Nulls', 'Percent Null'])

null_percentage(df)
Out[5]:
Total Nulls Percent Null
emp_length 8999 4.0456
In [6]:
# Let's also visualize the nulls in our table with missingno
import missingno as msno

msno.matrix(df)
plt.show()

Great! Given employment length is categorical, and there may be a meaningful reason for why this information is missing, we will create the 'N/A' type for nulls.

In [7]:
# 'N/A' if null
for i in ['emp_length']:
    df[i] = df[i].fillna('N/A')

null_percentage(df)
Out[7]:
Total Nulls Percent Null

Feature Engineering

Using the data in our table, we'll create the following features:

  • fico: Average of fico_range_low and fico_range_high.
  • cr_hist_days: The number of days since borrower first opened a credit line.
In [8]:
# Create fico
df['fico'] = (df['fico_range_low'] + df['fico_range_high']) / 2

# Create cr_hist_days
df['issue_d'] = pd.to_datetime(df['issue_d'])
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'])
df['cr_hist_days'] = (df['issue_d'] - df['earliest_cr_line']).dt.days

# Drop unnecessary features
df.drop(['issue_d', 'earliest_cr_line', 'fico_range_low', 'fico_range_high'],
        axis=1, inplace=True)

# Change delinq_2yrs datatype to int
df['delinq_2yrs'] = df['delinq_2yrs'].astype(int)

df.head()
Out[8]:
loan_status int_rate grade sub_grade term dti loan_amnt home_ownership installment purpose emp_length delinq_2yrs annual_inc fico cr_hist_days
0 Fully Paid 10.65 B B2 36 months 27.65 5000.0 RENT 162.87 credit_card 10+ years 0 24000.0 737.0 9830
1 Charged Off 15.27 C C4 60 months 1.00 2500.0 RENT 59.83 car < 1 year 0 30000.0 742.0 4627
2 Fully Paid 15.96 C C5 36 months 8.72 2400.0 RENT 84.33 small_business 10+ years 0 12252.0 737.0 3682
3 Fully Paid 13.49 C C1 36 months 20.00 10000.0 RENT 339.31 other 10+ years 0 49200.0 692.0 5782
4 Fully Paid 12.69 B B5 60 months 17.94 3000.0 RENT 67.79 other 1 year 0 80000.0 697.0 5813

Exploring Our Target Variable

Let's take a look at the different classes in our target variable, loan_status.

In [9]:
plt.figure(figsize=(12,8))

g = sns.countplot(df['loan_status'],
                  edgecolor = 'darkslategray',
                  palette = sns.color_palette('BrBG_r', 7))

g.set_xticklabels(g.get_xticklabels(), rotation=90)
g.set_xlabel(None)
g.set_ylabel('Loan Count')
g.set_title(label = "Loan Status", fontsize=25, fontweight='bold', pad=20)

for p in g.patches:
    g.text(p.get_x() + p.get_width() * .5,
           p.get_height() + 1000,
           '{0:.2%}'.format(p.get_height()/len(df)),
           ha = 'center')

plt.show()

Great! We're only interested in loans that were either Fully Paid or Charged Off (i.e. Defaulted). Let's drop the other loan statuses (which represent <0.5% of the data) and change the values of Fully Paid to 1 and Charged Off to 0.

In [10]:
# Drop other loan_status outcomes
desired_statuses = ['Fully Paid', 'Charged Off']
df = df[df['loan_status'].isin(desired_statuses)]

# Change loan_status to 0 (Fully Paid) or 1 (Charged Off)
df['loan_status'] = df['loan_status'].apply(lambda x: 0 if x=='Fully Paid' else 1)

print("{:.0f} loans remain.".format(len(df)))
print()
df['loan_status'].value_counts()
221428 loans remain.

Out[10]:
0    186976
1     34452
Name: loan_status, dtype: int64

Correlations

Let's take a look at correlations between our features.

  • Correlations with Target Variable: Identify which features best predict defaults.
  • Correlations between Other Features: Identify which feautres are highly correlated and may be duplicative.

We'll assign numerical values to our non-numeric features using .astype('category').cat.codes. This converts the feature data to a categorical data type, identifies the unique values in lexicographical order, and assigns the integers [0,1,2,...], respectively. For ordinal features like grade, sub_grade, and most of empl_length, this preserves the information contained in the feature's order. This blog post does a good job of explaining how this works in more detail.

In [11]:
plt.figure(figsize=(16, 12))
sns.set_context('paper', font_scale = 1)

sns.heatmap(df.assign(home_ownership = df.home_ownership.astype('category').cat.codes,
                      purpose = df.purpose.astype('category').cat.codes,
                      grade = df.grade.astype('category').cat.codes,
                      sub_grade = df.sub_grade.astype('category').cat.codes,
                      emp_length = df.emp_length.astype('category').cat.codes,
                      term = df.term.astype('category').cat.codes).corr(),
            annot = True, cmap = 'RdYlGn', vmin = -1, vmax = 1, linewidths = 0.5)
plt.show()

Correlations with Target Variable

  • While our features are not highly correlated with loan_status in general, we will visualize how loan_status varies with some of its more correlated features (as well as nominal features) later in this notebook.

Correlations between Features

  • int_rate, grade, and sub_grade are all highly correlated with each other (.95+) and provide duplicative information. Our tree algorithms will be most precise with a continuous numerical variable, so let's keep int_rate and drop grade and sub_grade.
  • loan_amnt is highly correlated with installment. Given installment is less correlation with loan_status, let's remove this feature as well.
In [12]:
df.drop(['grade','sub_grade','installment'], axis=1, inplace=True)

Visualizations

Time to explore our data with some visualizations!

Let's start by looking at the distribution of each feature individually. We'll finish up by investigating how loan_status varies across both its most-correlated features as well as nominal features (e.g. purpose).

Feature Distributions

In [13]:
fig, axes = plt.subplots(10,1,figsize=(16,60))

# List features & titles for each chart
var = ['int_rate', 'dti', 'fico', 'cr_hist_days', 'loan_amnt',
       'annual_inc', 'emp_length', 'home_ownership', 'purpose', 'term']

titles = ['Interest Rate', 'Debt to Income', 'Fico Score',
          'Length of Credit History (Days)', 'Loan Amount', 'Annual Income',
          'Employment Length', 'Home Ownership', 'Purpose', 'Term']

# Graph each feature by enumerating axes and using a for loop
for i, ax in enumerate(axes.flatten()):
    if i in [0,1,2,3,4]:
        sns.distplot(df[var[i]], ax = ax, bins = 80,
                     kde_kws = {'color' : 'darkolivegreen',
                                'label' : 'Kde',
                                'gridsize' : 1000,
                                'linewidth' : 3},
                     hist_kws = {'color' : 'goldenrod',
                                 'label' : "Histogram",
                                 'edgecolor' : 'darkslategray'})
    if i in [5]:
        sns.boxplot(df[var[i]], ax = ax)
    if i == 6:
        sns.countplot(df[var[i]], ax = ax,
                      order = ['N/A', '< 1 year', '1 year', '2 years', '3 years',
                             '4 years', '5 years', '6 years', '7 years', '8 years',
                             '9 years', '10+ years'])
    if i in [7, 8, 9]:
        sns.countplot(df[var[i]], ax = ax, order = df[var[i]].value_counts().index)
    if i == 8:
        ax.set_xticklabels(ax.get_xticklabels(), rotation = 30)
    ax.set_title(label = titles[i], fontsize = 25, fontweight = 'bold', pad = 15)
    ax.set_xlabel(None)

fig.suptitle('Individual Feature Distributions', position = (.52, 1.01),
             fontsize = 30, fontweight = 'bold')
fig.tight_layout(h_pad = 2)
plt.show()

Observations

  • Interest Rate: Roughly normally distributed between 5-25%, mean around 13%.
  • Debt to Income: Roughly normally distributed between 0-35%, mean around 15%.
  • Fico Score: Lognormally distributed between 660-850. It would appear LendingClub's underwriting cutoff is 660.
  • Length of Credit History (Days): While the distribution does tail off to the right, it's fairly normally distributed within 0-10,000 days around a mean of about 5,000 days.
  • Loan Amount: Ranges between \$0-\\$35K with a slight right skew. Appears LendingClub did not issue loans above $35K.
  • Annual Income: Heavily skewed to the right.
  • Employment Length: No clear pattern, about a third of the population has 10+ years of employment.
  • Home Ownership: About 50% of borrowers have a mortgage, roughly 40% rent, and most of the rest own.
  • Purpose: More than half the borrowers are using their loans for debt consolidation, next most popular is to pay off a credit card balance.

Nominal & Correlated Features

Now let's take a look at how loan_status varies across our nominal and correlated features.

Let's start with the nominal features: purpose, term, and home_ownership

In [14]:
import matplotlib as mpl

fig = plt.figure(figsize = (14, 10))

g = fig.add_gridspec(2, 2)
ax1 = fig.add_subplot(g[0, 0])
ax2 = fig.add_subplot(g[0, 1])
ax3 = fig.add_subplot(g[1, :])

axes = [ax1, ax2, ax3]

titles = ['Term', 'Home Ownership', 'Purpose']

var = ['term', 'home_ownership', 'purpose']

def to_percent(y,position):
    return str(str(int(round(y * 100, 0))) + "%")

for i, ax in enumerate(axes):
    sns.barplot(x = var[i], y = 'loan_status', data = df, palette = 'Blues',
                ax = ax, edgecolor = 'darkslategray')
    ax.set_ylabel('Default Rate')
    ax.yaxis.set_major_formatter(mpl.ticker.FuncFormatter(to_percent))
    ax.set_xlabel(None)
    if i in [0, 1, 2]:
        ax.set_ylabel('Default Rate')
    ax.set_title(label = titles[i], fontsize = 16, fontweight = 'bold', pad = 10)
    j = 0
    for p in ax.patches:
        ax.text(p.get_x() + p.get_width() * .25, p.get_height() + .0025,
                '{0:.0%}'.format(p.get_height()), ha = 'center')
        j += 1
    ax.set_xticklabels(ax.get_xticklabels(), rotation = 45)

fig.suptitle('Nominal Features', position = (.5,1.06), fontsize = 30, fontweight = 'bold')
fig.tight_layout(h_pad = 2)

Observations

  • Term: The probability of default on a 5-year loan is twice as high as the probability of default on a 3-year loan. This makes intuitive sense, as a longer term consists of more payments and thus more opportunities to default. More about this on Investopedia.
  • Home Ownership: There do not appear to be any significant trends between home_ownership and loan_status. While the average default rate is high for the OTHER category, there are not enough samples here for us to have confidence this default rate is higher.
  • Purpose: Clearly, certain loan purposes are riskier than others. For example, a small business loan (26% default rate) is much more risky than an auto loan (11% default rate).

Let's take a look at the most correlated features: fico & int_rate

In [15]:
fig = plt.figure(figsize = (10, 10))

g = fig.add_gridspec(2, 1)
ax1 = fig.add_subplot(g[0, 0])
ax2 = fig.add_subplot(g[1, 0])

axes = [ax1, ax2]

titles = ['Loans by Fico', 'Loans by Interest Rate',
          'Defaults by Fico', 'Defaults by Interest Rate']

var = ['int_rate', 'fico', 'int_rate', 'fico']

for i, ax in enumerate(axes):
    ax.hist(df[df['loan_status'] == 0][var[i]], bins = 25, color = 'blue',
            label = 'Fully Paid', alpha = .5)
    ax.hist(df[df['loan_status'] == 1][var[i]], bins = 25, color = 'red',
            label = 'Defaulted', alpha = .5)
    ax.legend()
    ax.set_title(label = titles[i], fontsize = 16, fontweight = 'bold', pad = 10)
    ax.set_ylabel('Loan Count')
    if i == 0:
        ax.annotate('Share of defaulted loans\nincreases with interest rate',
                    xy = (21.5, 5000), xytext = (22, 10000),
                    arrowprops = dict(facecolor = 'Green',
                                      shrink = 0.05))
    if i == 1:
        ax.annotate('Share of defaulted loans\ndecreases with fico',
                    xy = (760, 7500), xytext = (780, 15000),
                    arrowprops = dict(facecolor = 'Green',
                                      shrink = 0.05))
for i in [2, 3]:
    sns.lmplot(var[i], 'loan_status', df, height = 5, aspect = 2, y_jitter = .04)
    h = plt.gca()
    h.yaxis.set_major_formatter(mpl.ticker.FuncFormatter(to_percent))
    h.set(xlabel = None, ylabel = 'Default Rate', ylim = (-0.1, 1.19))
    h.set_title(label = titles[i], fontsize = 16, fontweight = 'bold', pad = 10)

fig.suptitle('Correlated Features', position = (.52, 1.06), fontsize = 30, fontweight = 'bold')
fig.tight_layout(h_pad = 4)

Observations

  • Fico: In general, the default rate decrease with fico. Note in the histogram how the ratio of defaulted loans to fully-paid loans decreases as fico increases. The lmplot also shows how defaults and fico are negatively correlated.
  • Interest Rate: In general, the default rate increases with interest rate. Note in the histogram how the ratio of defaulted loans to fully-paid loans increases as interest rate increases. The lmplot also shows how defaults and interest rate are positively correlated.

Additional Views

In this last section of visualizations, we'll explore relationships between a few different features, and we'll compare the behavior of defaults to fully paid loans across different subpopulations of the data.

let's start by investigating the trend between int_rate and fico. In general, we would expect interest rate to decrease as fico increases. Not only does this make intuitive sense, but we see these features are negatively correlated (-0.54) in our correlation plot above. Let's see if this trend waivers at all within the four term by loan_status subpopulations.

In [16]:
g = sns.jointplot(x = 'fico', y = 'int_rate', data = df,
                  color = 'purple', kind = 'kde', height = 10)
g.fig.suptitle("Interest Rate by Fico", fontsize = 30, fontweight = 'bold')
g.fig.subplots_adjust(top = 0.91)

h = sns.lmplot('fico', 'int_rate', df, row = 'loan_status', col = 'term',
               palette = 'Set1', height = 5)
h.fig.suptitle("Subpopulations: Term by Loan Status", fontsize = 20, fontweight = 'bold')
h.fig.subplots_adjust(top = 0.9)

plt.show()

Observations

Okay cool, nothing out of the ordinary here. Across all term and loan status subpopulations, we see a consistent decrease in interest rate as fico increases.

Now, let's investigate the trend between int_rate and annual_inc. Assuming that borrowers who make more money are less likely to default, we would expect interest rate to decrease as income increases. However, per the correlation plot above, it appears that interest rate has almost no correlation with annual income whatsoever (-0.01). This seems odd, given it makes intuitive sense that higher income means less risk... Let's see if this lack of correlation is consistent within term by loan_status subpopulations as well.

Note: annual_inc is highly skewed. To best visualize trends in the data, we will use the log of annual income (log_annual_inc).

In [17]:
df['log_annual_inc'] = np.log(df['annual_inc'])

g = sns.lmplot('log_annual_inc', 'int_rate', df, height = 5,
               aspect = 2, palette = 'coolwarm', col = 'term')
g.fig.suptitle("Interest Rate by Log(Annual Income)", fontsize = 25, fontweight = 'bold')
g.fig.subplots_adjust(top = 0.75)


h = sns.lmplot('log_annual_inc', 'int_rate', df, hue = 'loan_status', height = 5,
               aspect = 2, palette = 'coolwarm', col = 'term')
h.fig.suptitle("Loan Status Breakout", fontsize = 20, fontweight = 'bold')
h.fig.subplots_adjust(top = 0.8)

plt.show()

df.drop(['log_annual_inc'], axis = 1, inplace = True)

Observations

  • For 36-month loans, the interest rate tends to decrease as income increases.
  • For 60-month loans the interest rate tends to increase as income increases.

I'm not sure why this would be... putting a pin in this for now.

Finally, let's take a look at loan_amnt by term.

In [18]:
plt.figure(figsize = (12.5, 2))
g = sns.boxplot(x = 'loan_amnt', y = 'term', data = df)
g.set_xlabel(None)
g.set_ylabel(None)
g.set(xticklabels=[], yticklabels=[])
g.set(xticks=[], yticks=[])
plt.suptitle("Distribution of Loan Amount", fontsize = 20, fontweight = 'bold', position = (.52, 1.2))

fig = sns.FacetGrid(df, hue = 'term', aspect = 2.5, height = 5)
fig.map(sns.kdeplot, 'loan_amnt', shade = True)
fig.set(xlim = (0, df['loan_amnt'].max()), yticks=[])
fig.add_legend()
plt.show()

print('\n')

df['loan_status_term'] =  df['term'] + df['loan_status'].apply(lambda x: ' default' if x==1 else ' fully paid')

plt.figure(figsize = (12.5, 2))
h = sns.boxplot(x = 'loan_amnt', y = 'loan_status_term', data = df, order = sorted(df['loan_status_term'].unique()))
h.set_xlabel(None)
h.set_ylabel(None)
h.set(xticklabels=[], yticklabels=[])
h.set(xticks=[], yticks=[])
plt.suptitle("Loan Status Breakout", fontsize = 18, fontweight = 'bold', position = (.51, 1.2))

fig = sns.FacetGrid(df, hue = 'loan_status_term', aspect = 2.5, height = 5)
fig.map(sns.kdeplot, 'loan_amnt', shade = True)
fig.set(xlim = (0, df['loan_amnt'].max()), yticks=[])
fig.add_legend(label_order = sorted(df['loan_status_term'].unique()))
plt.show()

df.drop(['loan_status_term'], axis = 1, inplace = True)

Observations

  • Overall, the loan amount for 60-month loans tends to be much higher than for 36-month loans. The average 60-month loan is for about $20K, whereas the average 36-month loan is about half that much.

  • Few 36-month loans exceed $25K.

  • The distribution of defaulted loans is very similar to fully paid loans, regardless of term.

Additional Preprocessing

Note: Tree-based algorithms are not sensitive to feature magnitudes. So standardizing our data (e.g. scaling and normalizing) is not necessary before fitting our three models. Here's a helpful article that explains this in more detail: When and Why to Standardize Your Data. However, at the end of this notebook I'd like to compare the results of my tree-based models to a neural network, for which standardizing data is recommended (link). We'll standardize the data now to keep everything consistent for all models.

Box-Cox

First, we'll normalize our features with a Box-Cox transformation.

In [19]:
# Boxcox transform
from scipy.stats import boxcox

numerical = df.columns[df.dtypes == 'float64']
for i in numerical:
    if df[i].min() > 0:
        transformed, lamb = boxcox(df.loc[df[i].notnull(), i])
        if np.abs(1 - lamb) > 0.02:
            df.loc[df[i].notnull(), i] = transformed

Dummy Variables

Next, let's create dummy variables for our categorical features.

In [20]:
df_final = pd.get_dummies(df, drop_first = True)
df_final.head(3)
Out[20]:
loan_status int_rate dti loan_amnt delinq_2yrs annual_inc fico cr_hist_days term_ 60 months home_ownership_NONE home_ownership_OTHER home_ownership_OWN home_ownership_RENT purpose_credit_card purpose_debt_consolidation purpose_educational purpose_home_improvement purpose_house purpose_major_purchase purpose_medical purpose_moving purpose_other purpose_renewable_energy purpose_small_business purpose_vacation purpose_wedding emp_length_10+ years emp_length_2 years emp_length_3 years emp_length_4 years emp_length_5 years emp_length_6 years emp_length_7 years emp_length_8 years emp_length_9 years emp_length_< 1 year emp_length_N/A
0 0 5.004289 27.65 66.423672 0 6.890360 0.103197 9830 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1 1 6.546006 1.00 50.252940 0 6.988011 0.103197 4627 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
2 0 6.757855 8.72 49.428550 0 6.585215 0.103197 3682 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0

Scaling

Great! Now, let's define our train and test populations and scale our data.

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
In [22]:
X = df_final.drop('loan_status', axis = 1)
y = df_final['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state = 13)

# Scale data
sc = StandardScaler()
numerical = X_train.columns[(X_train.dtypes =='float64') | (X_train.dtypes == 'int64')].tolist()
X_train[numerical] = sc.fit_transform(X_train[numerical])
X_test[numerical] = sc.transform(X_test[numerical])

X_train = X_train.values
y_train = y_train.values
X_test = X_test.values
y_test = y_test.values

Confusion Matrix Function

Before we proceed with training & testing our classifiers, let's create a function to cleanly plot a confusion matrix.

In [23]:
# Define function to plot confusion matrix

import itertools

def plot_cm(cm, classes, normalize = False, title = 'Confusion matrix', cmap = 'Blues'):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]
        print('Normalized confusion matrix')
    else:
        print('Confusion matrix, without normalization')
    plt.imshow(cm, interpolation = 'nearest', cmap = cmap)
    plt.title(title, fontsize = 14)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation = 45)
    plt.yticks(tick_marks, classes)
    plt.ylabel('Actual')
    plt.xlabel('Predicted')

    fmt = '.4f' if normalize else 'd'
    thresh = cm.max() / 1.5
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment = "center",
                 color = "white" if cm[i, j] > thresh else "black")
    plt.tight_layout()

And last, let's check our class imbalance.

In [24]:
df['loan_status'].value_counts()/len(df)
Out[24]:
0    0.84441
1    0.15559
Name: loan_status, dtype: float64

Tree-Based Classifiers

Alright, we're ready to build our tree-based classifiers! We'll build a simple Decision Tree, a Random Forest, and an XGBoosted Tree.

Let's start with our Decision Tree.

Decision Tree

First, a few implementation details:

SMOTE Oversampling: There is a significant class imbalance in our data (84% Fully Paid, 16% Defaulted). It is important that we create a 50/50 split in our training data because otherwise we risk the model overfitting to the majority class (i.e. always predicting "Fully Paid" would give 84% accuracy). Another way to think about this is that we want to train our model to recognize the characteristics of defaults, rather than assume that most loans will pay in full.

To create this 50/50 split in our training data, we'll oversample the defaulted loans using the SMOTE technique. With SMOTE, we increase the number of loans in the minority class (i.e. Defaulted) through creating synthetic samples in between existing ones that are in close proximity. Synthetic loans continue to be created until the counts of both the majority class (Fully Paid) and the minority class (Defaulted) are equal.

GridSearchCV: To optimize our model's performance, we'll tune our hyperparameters by looking at all combinations with GridSearchCV.

make_pipeline: We'll create a pipeline with make_pipeline to train our model on SMOTE oversampled data and cross-validate our results using GridSearchCV to ensure that we choose the best hyperparameters.

In [60]:
# Import libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline

# Define hyperparameter values to grid search
tree_params = {"criterion" : ["gini", "entropy"],
               "max_depth" : [4],
               "min_samples_leaf" : [2],
               "min_samples_split" : [2]}

# Tune hyperparameters with GridSearchCV
grid_tree = GridSearchCV(estimator = DecisionTreeClassifier(),
                               param_grid = tree_params,
                               cv = 3,
                               verbose = 0)

# Train Decision Tree on data balanced with SMOTE oversampling
pipeline = make_pipeline(SMOTE(sampling_strategy = 'minority'), grid_tree)
pipeline.fit(X_train, y_train)

# Choose best hyperparamters and make predictions
grid_tree_best = grid_tree.best_estimator_
predictions = grid_tree_best.predict(X_test)

# Display results
print(classification_report(y_test, predictions))
print("=" * 60)
tree_cm = confusion_matrix(y_test, predictions)
labels = ['Fully Paid', 'Defaulted']

plt.figure(figsize = (6, 5))
plot_cm(cm = tree_cm, classes = labels, title = "Decision Tree\nConfusion Matrix", normalize = True)
plt.show()
              precision    recall  f1-score   support

           0       0.89      0.72      0.80     56027
           1       0.25      0.51      0.34     10402

    accuracy                           0.69     66429
   macro avg       0.57      0.62      0.57     66429
weighted avg       0.79      0.69      0.72     66429

============================================================
Normalized confusion matrix
In [61]:
plt.figure(figsize = (6, 5))
plot_cm(cm = tree_cm, classes = labels, title = "Decision Tree\nConfusion Matrix", cmap = 'Greens')
Confusion matrix, without normalization

Alright, looks like our Decision Tree model was able to recognize 51% of defaults and 72% of fully paid loans. Another way to look at this is that our model sacrificed 15,702 safe loans to identify 5,337 defaults.

Let's take a look at the optimal hyperparameters chosen by GridSearchCV:

In [62]:
grid_tree.best_params_
Out[62]:
{'criterion': 'entropy',
 'max_depth': 4,
 'min_samples_leaf': 2,
 'min_samples_split': 2}

Looks like entropy outperformed gini for measuring impurity. The model also chose our highest max_depth value (4), our lowest min_samples_leaf value (2), and our lowest min_samples_split value (2).

Let's see which features were most important in our model. To do this, we'll use the attribute .featureimportances, which returns the importance of features by computing their (normalized) total reduction in impurity. Here's more detail on how feature importances are measured%20calculates%20each,number%20of%20samples%20it%20splits.).

In [63]:
tree_model = grid_tree_best

feat = pd.DataFrame(columns = ['Feature', 'Importance'])
feat['Feature'] = X.columns
feat['Importance'] = tree_model.feature_importances_
feat.sort_values(by = 'Importance', ascending = False, inplace = True)

plt.figure(figsize = (10, 6))
g = sns.barplot(x = 'Feature', y = 'Importance', palette = 'Greens_r',
                data = feat[feat['Importance'] != 0])
g.set_xticklabels(g.get_xticklabels(), rotation = 30)
g.set_ylabel('Relative Importance')
g.set_title(label = "Decision Tree Feature Importance", fontsize = 18,
            fontweight = 'bold', pad = 20)
plt.show()

Interest rate was by far the most important feature for predicting defaults... which makes sense. Term also played a role in prediction, which seems consistent with the stark difference in default rate we observed between 36 and 60 month loans earlier in our analysis.

Let's visualize the decision tree to see exactly how our model is making decisions.

In [64]:
# Visualize grid_tree_best
from sklearn import tree

plt.figure(figsize = (30, 10))
tree.plot_tree(grid_tree.best_estimator_,
               feature_names = df_final.drop('loan_status', axis = 1).columns,
               class_names = ['Fully Paid', 'Defaulted'],
               filled = True)
plt.show()

Great! We can see what a significant role interest rate plays in reducing entropy throughout our model. For more on how to read/visualize decision trees, checkout this article.

Finally, let's create a dataframe to track the results of our models to compare their performances later on.

In [80]:
model_df = pd.DataFrame(columns = ['model', 'false_negatives', 'false_positives'])
model_df.loc[len(model_df)] = ['Baseline', sum(y_test), 0]
model_df.loc[len(model_df)] = ['Decision Tree', tree_cm[1, 0], tree_cm[0, 1]]
model_df
Out[80]:
model false_negatives false_positives
0 Baseline 10402 0
1 Decision Tree 5065 15702

Random Forest

Awesome! Now let's train and test our Random Forest model.

A Random Forest classification model consists of many Decision Trees. It performs classification by choosing the most common class predicted by its trees. Understanding Random Forests

RandomizedSearchCV: To optimize the performance of our Random Forest, we'll tune our hyperparameters with RandomizedSearchCV instead of GridSearchCV. With GridSearchCV, every combination of hyperparameters are tried. RandomizedSearchCV, on the other hand, tries only a predefined number (n_iter) of random combinations of hyperparameters. Using RandomizedSearchCV is preferable here because testing every combination of hyperparameters would become too costly / take too much time. RandomizedSearchCV will also be preferable for training our XGBoost model, where we'll want to define certain hyperparamters as continuous distributions. This article gives a good, high-level comparison of GridSearch and RandomizedSearch.

In [78]:
# Import libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Define hyperparameter values to random search
rf_params = {'n_estimators' : [100, 300, 600, 1000],
             'max_features' : ['auto', 'log2'],
             'max_depth' : randint(2, 5),
             'min_samples_split' : randint(2, 5),
             'min_samples_leaf' : randint(2, 5),
             'bootstrap' : [True, False]}

# Tune hyperparameters with RandomizedSearchCV
grid_rf = RandomizedSearchCV(estimator = RandomForestClassifier(),
                             param_distributions = rf_params,
                             n_iter = 200,
                             cv = 3,
                             verbose = 0,
                             random_state = 13,
                             n_jobs = -1)

# Train Random Forest on data balanced with SMOTE oversampling
pipeline = make_pipeline(SMOTE(sampling_strategy = 'minority'), grid_rf)
pipeline.fit(X_train, y_train)

# Choose best hyperparamters and make predictions
grid_rf_best = grid_rf.best_estimator_
predictions = grid_rf_best.predict(X_test)

# Display results
print(classification_report(y_test, predictions))
print("=" * 60)
rf_cm = confusion_matrix(y_test, predictions)
labels = ['Fully Paid', 'Defaulted']
plt.figure(figsize = (6, 5))
plot_cm(cm = rf_cm, classes = labels, title = "Random Forest\nConfusion Matrix", normalize = True)
plt.show()
              precision    recall  f1-score   support

           0       0.90      0.65      0.76     56027
           1       0.24      0.61      0.35     10402

    accuracy                           0.64     66429
   macro avg       0.57      0.63      0.55     66429
weighted avg       0.80      0.64      0.69     66429

============================================================
Normalized confusion matrix
In [79]:
plt.figure(figsize = (6, 5))
plot_cm(cm = rf_cm, classes = labels, title = "Random Forest\nConfusion Matrix", cmap = 'Purples')
Confusion matrix, without normalization

Good, looks like our Random Forest model was able to recognize 61% of defaults and 65% of fully paid loans. Another way to look at this is that our model sacrificed 19,532 safe loans to identify 6,323 defaults.

Let's take a look at the optimal hyperparameters chosen by RandomizedSearchCV:

In [87]:
grid_rf.best_params_
Out[87]:
{'n_estimators': 600,
 'min_samples_split': 3,
 'min_samples_leaf': 4,
 'max_features': 'log2',
 'max_depth': 4,
 'bootstrap': True}

Looks like bootstrapping performed best, and that the log2 method of selecting features to consider at each split also performed best. The model chose our highest max_depth value (4), our highest min_samples_leaf value (4), and a min_samples_split value of (3). Finally, our Random Forest performed best when including 600 decision trees.

Let's see which features were most important in our model.

In [34]:
tree_model = grid_rf_best

feat = pd.DataFrame(columns = ['Feature', 'Importance'])
feat['Feature'] = X.columns
feat['Importance'] = tree_model.feature_importances_
feat.sort_values(by = 'Importance', ascending = False, inplace = True)

plt.figure(figsize = (10, 6))
g = sns.barplot(x = 'Feature', y = 'Importance', palette = 'Purples_r',
                data = feat[feat['Importance'] != 0])
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
g.set_ylabel('Relative Importance')
g.set_title(label = "Random Forest Feature Importance", fontsize = 18,
            fontweight = 'bold', pad = 20)
plt.show()

Once again, we see that interest rate and loan term are the two most important features, respectively. In our Random Forest, however, term seems to play a larger role than it did in our simple Decision Tree model. We also see a handful of additional features with lower importance (loan_amount, home_ownership, dti etc.)

In [81]:
model_df.loc[len(model_df)] = ['Random Forest', rf_cm[1, 0], rf_cm[0, 1]]
model_df
Out[81]:
model false_negatives false_positives
0 Baseline 10402 0
1 Decision Tree 5065 15702
2 Random Forest 4079 19532

XGBoost

Great! Now we're ready to train and test our XGBoost Tree model.

XGBoosted Trees are similar to Random Forests, in that both models combine the results of a set of Decision Trees. XGBoost differs from Random Forest in the way it builds those Decision Trees. Random Forests build each tree independently. XGBoost builds one tree at a time in a forward stage-wise manner, introducing a weak learner after each tree is built to improve the shortcomings of existing weak learners. More on this.

By carefully tuning parameters, gradient boosting can result in better performance than random forests. However, gradient boosting may not be a good choice if you have a lot of noise, as it can result in overfitting. It also tends to be harder to tune than random forests.

In [36]:
# Import libraries
import xgboost as xgb
from scipy.stats import uniform

# Define hyperparameter values to random search
xgb_params = {'n_estimators' : randint(50, 300),
              'max_depth' : randint(2, 5),
              'min_samples_split' : randint(2, 5),
              'min_samples_leaf' : randint(2, 5),
              'min_child_weight' : uniform(loc = 1, scale = 0.5),
              'gamma' : uniform(loc = 0.6, scale = 0.4),
              'reg_lambda' : uniform(loc = 1, scale = 2),
              'reg_alpha' : uniform(loc = 0, scale = 1),
              'learning_rate' : uniform(loc = .001, scale = .009)}

# Tune hyperparameters with RandomizedSearchCV
grid_xgb = RandomizedSearchCV(estimator = xgb.XGBClassifier(),
                              param_distributions = xgb_params,
                              n_iter = 250,
                              cv = 3,
                              verbose = 0,
                              random_state = 13,
                              n_jobs = -1)

# Train XGBoost on data balanced with SMOTE oversampling
pipeline = make_pipeline(SMOTE(sampling_strategy = 'minority'), grid_xgb)
pipeline.fit(X_train, y_train)

# Choose best hyperparamters and make predictions
grid_xgb_best = grid_xgb.best_estimator_
predictions = grid_xgb_best.predict(X_test)

# Display results
print(classification_report(y_test, predictions))
print("=" * 60)
xgb_cm = confusion_matrix(y_test, predictions)
labels = ['Fully Paid', 'Defaulted']
plt.figure(figsize = (6, 5))
plot_cm(cm = xgb_cm, classes = labels, title = "XGBoost\nConfusion Matrix", normalize = True)
plt.show()
[04:29:08] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { min_samples_leaf, min_samples_split } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


              precision    recall  f1-score   support

           0       0.89      0.68      0.77     56027
           1       0.25      0.57      0.34     10402

    accuracy                           0.66     66429
   macro avg       0.57      0.62      0.56     66429
weighted avg       0.79      0.66      0.70     66429

============================================================
Normalized confusion matrix
In [37]:
plt.figure(figsize = (6,5))
plot_cm(cm = xgb_cm, classes = labels, title = "XGBoost\nConfusion Matrix", cmap = 'Reds')
Confusion matrix, without normalization

Great, looks like our XGBoost model was able to recognize 57% of defaults and 68% of fully paid loans. Another way to look at this is that our model sacrificed 18,125 safe loans to identify 5,919 defaults.

Let's take a look at the optimal hyperparameters chosen by RandomizedSearchCV, as well as the features that were most important in our model:

In [38]:
grid_xgb.best_params_
Out[38]:
{'gamma': 0.9701785202144155,
 'learning_rate': 0.009272734287074282,
 'max_depth': 4,
 'min_child_weight': 1.344238993697839,
 'min_samples_leaf': 4,
 'min_samples_split': 2,
 'n_estimators': 293,
 'reg_alpha': 0.9996842886452724,
 'reg_lambda': 1.6056878845383316}
In [39]:
tree_model = grid_xgb_best

feat = pd.DataFrame(columns = ['Feature', 'Importance'])
feat['Feature'] = X.columns
feat['Importance'] = tree_model.feature_importances_
feat.sort_values(by = 'Importance', ascending = False, inplace = True)

plt.figure(figsize = (10,6))
g = sns.barplot(x = 'Feature', y = 'Importance', palette = 'Reds_r',
                data = feat[feat['Importance'] != 0])
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
g.set_ylabel('Relative Importance')
g.set_title(label = "XGBoost Feature Importance", fontsize = 18,
            fontweight = 'bold', pad = 20)
plt.show()

Interesting! Term is now at the top of the list, followed by type of homeownership, followed by interest rate. Purpose = debt consolidation also played a significant role.

In [82]:
model_df.loc[len(model_df)] = ['XGBoost', xgb_cm[1, 0], xgb_cm[0, 1]]
model_df
Out[82]:
model false_negatives false_positives
0 Baseline 10402 0
1 Decision Tree 5065 15702
2 Random Forest 4079 19532
3 XGBoost 4483 18125

Neural Network

Now let's implement a simple Neural Network to see how it performs against our tree-based models!

To create our Neural Network, let's have one input layer with the same number of nodes as features plus a bias node, a second hidden layer with 38 nodes, and one output node classifying the loan as 0 (Fully Paid) or 1 (Defaulted).

In [41]:
import itertools
import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers import Activation
from keras.layers.core import Dense
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy
In [42]:
sm = SMOTE(sampling_strategy = 'minority', random_state = 13)
X_train_sm, y_train_sm = sm.fit_sample(X_train, y_train)

NN = Sequential([Dense(X_train.shape[1], input_shape = (X_train.shape[1], ), activation = 'relu'),
                 Dense(38, activation = 'relu'),
                 Dense(2, activation = 'softmax')])

NN.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
dense (Dense)                (None, 36)                1332
_________________________________________________________________
dense_1 (Dense)              (None, 38)                1406
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 78
=================================================================
Total params: 2,816
Trainable params: 2,816
Non-trainable params: 0
_________________________________________________________________
In [43]:
NN.compile(Adam(lr = 0.001), metrics = ['accuracy'], loss = 'sparse_categorical_crossentropy')

NN.fit(X_train_sm, y_train_sm, validation_split = 0.2, batch_size = 250, epochs = 25,
       shuffle = True, verbose = 2)

pred = NN.predict_classes(X_test)
Epoch 1/25
839/839 - 1s - loss: 0.6103 - accuracy: 0.6595 - val_loss: 0.8870 - val_accuracy: 0.3590
Epoch 2/25
839/839 - 1s - loss: 0.6020 - accuracy: 0.6682 - val_loss: 0.9421 - val_accuracy: 0.3353
Epoch 3/25
839/839 - 1s - loss: 0.5995 - accuracy: 0.6706 - val_loss: 0.8495 - val_accuracy: 0.4169
Epoch 4/25
839/839 - 1s - loss: 0.5974 - accuracy: 0.6716 - val_loss: 0.8592 - val_accuracy: 0.4213
Epoch 5/25
839/839 - 1s - loss: 0.5955 - accuracy: 0.6740 - val_loss: 0.8130 - val_accuracy: 0.4760
Epoch 6/25
839/839 - 1s - loss: 0.5942 - accuracy: 0.6744 - val_loss: 0.8410 - val_accuracy: 0.4393
Epoch 7/25
839/839 - 1s - loss: 0.5925 - accuracy: 0.6760 - val_loss: 0.8439 - val_accuracy: 0.4258
Epoch 8/25
839/839 - 1s - loss: 0.5910 - accuracy: 0.6773 - val_loss: 0.8496 - val_accuracy: 0.4275
Epoch 9/25
839/839 - 1s - loss: 0.5895 - accuracy: 0.6790 - val_loss: 0.8021 - val_accuracy: 0.4807
Epoch 10/25
839/839 - 1s - loss: 0.5880 - accuracy: 0.6799 - val_loss: 0.8205 - val_accuracy: 0.4562
Epoch 11/25
839/839 - 1s - loss: 0.5865 - accuracy: 0.6822 - val_loss: 0.9166 - val_accuracy: 0.3623
Epoch 12/25
839/839 - 1s - loss: 0.5850 - accuracy: 0.6826 - val_loss: 0.8066 - val_accuracy: 0.4792
Epoch 13/25
839/839 - 1s - loss: 0.5840 - accuracy: 0.6837 - val_loss: 0.8184 - val_accuracy: 0.4659
Epoch 14/25
839/839 - 1s - loss: 0.5823 - accuracy: 0.6852 - val_loss: 0.8772 - val_accuracy: 0.3973
Epoch 15/25
839/839 - 1s - loss: 0.5809 - accuracy: 0.6865 - val_loss: 0.8303 - val_accuracy: 0.4716
Epoch 16/25
839/839 - 1s - loss: 0.5800 - accuracy: 0.6877 - val_loss: 0.8190 - val_accuracy: 0.4653
Epoch 17/25
839/839 - 1s - loss: 0.5786 - accuracy: 0.6899 - val_loss: 0.8657 - val_accuracy: 0.4170
Epoch 18/25
839/839 - 1s - loss: 0.5773 - accuracy: 0.6903 - val_loss: 0.8040 - val_accuracy: 0.4834
Epoch 19/25
839/839 - 1s - loss: 0.5762 - accuracy: 0.6906 - val_loss: 0.7489 - val_accuracy: 0.5455
Epoch 20/25
839/839 - 1s - loss: 0.5751 - accuracy: 0.6924 - val_loss: 0.8648 - val_accuracy: 0.4243
Epoch 21/25
839/839 - 1s - loss: 0.5742 - accuracy: 0.6931 - val_loss: 0.8014 - val_accuracy: 0.4964
Epoch 22/25
839/839 - 1s - loss: 0.5730 - accuracy: 0.6933 - val_loss: 0.8173 - val_accuracy: 0.4535
Epoch 23/25
839/839 - 1s - loss: 0.5716 - accuracy: 0.6944 - val_loss: 0.8510 - val_accuracy: 0.4292
Epoch 24/25
839/839 - 1s - loss: 0.5709 - accuracy: 0.6960 - val_loss: 0.7756 - val_accuracy: 0.5166
Epoch 25/25
839/839 - 1s - loss: 0.5703 - accuracy: 0.6966 - val_loss: 0.7612 - val_accuracy: 0.5353
WARNING:tensorflow:From <ipython-input-43-7fbea7402709>:6: Sequential.predict_classes (from tensorflow.python.keras.engine.sequential) is deprecated and will be removed after 2021-01-01.
Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).
In [44]:
NN_cm = confusion_matrix(y_test, pred)
labels = ['Fully Paid', 'Defaulted']

plt.figure(figsize = (6, 5))
plot_cm(cm = NN_cm, classes = labels, title = "NN\nConfusion Matrix", normalize = True)
Normalized confusion matrix
In [45]:
plt.figure(figsize = (6, 5))
plot_cm(cm = NN_cm, classes = labels, title = "NN\nConfusion Matrix", cmap = 'Oranges')
Confusion matrix, without normalization

Awesome, looks like our Neural Network was able to recognize 39% of defaults and 80% of fully paid loans. Another way to look at this is that our model sacrificed 11,183 safe loans to identify 4,025 defaults.

In [83]:
model_df.loc[len(model_df)] = ['Neural Network', NN_cm[1, 0], NN_cm[0, 1]]
model_df
Out[83]:
model false_negatives false_positives
0 Baseline 10402 0
1 Decision Tree 5065 15702
2 Random Forest 4079 19532
3 XGBoost 4483 18125
4 Neural Network 6377 11183

Conclusion

We're ready to compare the results of our models.

Evaluating Performance: A perfect model would be able to predict all loans correctly, allowing us to avoid all future defaults and only accept loans that will pay in full. While that's a nice thought, none of our models are perfect. Instead, we want to find which of our models does the best job of minimizing total loss. Loss comes from misclassifications. There are two types:

  1. Misclassifying loans that will default: When our model approves a loan that will default in the future, we lose the remaining balance plus interest to be paid. These are false_negatives.

  2. Misclassifying loans that will pay in full: When our model denies loans that will pay in full, we lose the interest we could have made on that loan. These are false_positives.

We can calculate our total loss as:

Total Loss = (false_negatives * false_negative_avg_cost) + (false_positives * false_positive_avg_cost)

While we don't know the average cost of false negatives or the average cost of false positives, we can look at the ratio between the two to evalute the above equation. Using this ratio allows us to simplify the above equation as follows:

loss_ratio = false_positive_avg_cost / false_negative_avg_cost

=>    Total Loss = (false_negatives * false_negative_avg_cost) +
                   (false_positives * false_negative_avg_cost * loss_ratio)

=>    Total Loss / false_negative_avg_cost = false_negatives + false_positives * loss_ratio

The best model is the one that minimizes false_negatives + false_positives * loss_ratio for a given loss_ratio.

It makes sense that false negatives are more costly than false positives (i.e. we lose more misclassifying defaults than safe loans). Thus, we'll just look at loss_ratios between 0 and 1 to evaluate model performance.

In [147]:
loss_ratio = np.linspace(0, 1, 1000) #false_positive_avg_cost / false_negative_avg_cost

# plot total loss for all models across for loss_ratio 0-1.
plt.figure(figsize = (16, 10))

for i in range(len(model_df)):
    x = loss_ratio
    y = model_df['false_negatives'][i] * np.ones(len(loss_ratio)) + model_df['false_positives'][i] * loss_ratio
    plt.plot(x, y)

plt.xticks(np.arange(0, 1, 0.02), rotation = 90)
plt.grid()
plt.xlabel('\nLoss Ratio\n\n(cost of misclassifying a safe loan) /\n(cost of misclassifying a default)')
plt.ylabel('Total Loss\n\n(divided by avg cost of misclassifying a default)')
plt.title('\nComparing Model Performance: Total Loss', fontsize = 20, pad = 20, fontweight = 'bold')
labels = model_df['model']
plt.legend(labels)


plt.show()

Great! Looking at the minimum values in the graph above, we should choose the following model under each loss_ratio condition.

  • loss_ratio < 27% -- Random Forest: Choose the Random model if the expected loss from misclassifying a safe loan is less than 28% the expected loss from misclassifying a default. Note that XGBoost closes the gap on the Random Forest as we approach the 27% loss_ratio, and outperforms it above this level.
  • 27% <= loss_ratio < 36% -- Neural Network: Choose the Neural Network model if the expected loss from misclassifying a safe loan is between 27% and 36% the expected loss from misclassifying a default.
  • 36% <= loss_ratio -- Baseline: Choose the Baseline model (assume all loans are safe) if the expected loss from misclassifying a safe loan is at least 36% the expected loss from misclassifying a default.

For no loss_ratio was the Decision Tree model the best choice. Why Random Forests Outperform Decision Trees

Final Thoughts

So what's a reasonable loss_ratio to assume? Well, that depends on how much we expect to make on a safe loan vs how much we expect to lose on a default. While both of these values depend on a wide range of factors, we can (grossly) oversimplify our estimation as follows

  • Assumptions

    • $1000 loan amount
    • 5-yr loan
    • Assume 5% incremental annual return (LendingClub takes a cut, and considering alternative investment strategies like a 5-yr CD)
    • Assume we only make half our money back if a default occurs.
  • Amount we expect to make if loan is fully paid (link to calculator) = $132.27

  • Amount we expect to lose on default = $566.14

loss_ratio = 132.27 / 566.14 = 23.4%

In this case, our best model would be the Random Forest.

Thank you!

Thanks for going on this journey with me! If you have any questions, suggestions for improving upon my approach, or just like my work, please don't hesitate to reach out at keilordykengilbert@gmail.com. Having a conversation is the best way for everyone involved to learn & improve 🌲😊🌲

In [148]:
! jupyter nbconvert --to html Personal_Loans_and_Decision_Trees.ipynb
[NbConvertApp] Converting notebook Personal_Loans_and_Decision_Trees.ipynb to html
[NbConvertApp] Writing 4344996 bytes to Personal_Loans_and_Decision_Trees.html
In [ ]: