<  return home

Credit Card Fraud Detection

First, I'd like to thank Janio Martinez & Pavan Sanagapati for inspiring me to take on this project. Both have provided excellent, in-depth analyses of this Kaggle Dataset, and I've learned a great deal from both.

I'd also like to thank Marco Altini for his article Dealing with Imbalanced Data: Undersampling, Oversampling, and Proper Cross-Validation, which helped me better understand many of the concepts in this notebook.

If you have any questions, suggestions for improving upon my approach, or just like my work, please don't hesitate to reach out at keilordykengilbert@gmail.com. Having a conversation is the best way for everyone involved to learn & improve 😊

Introduction

In this kernel, our overall goal is to develop the best model for classifying both fraud and safe transactions correctly, working with Kaggle's Credit Card Fraud Detection dataset.

The Data

We are given a dataset of roughly 285K transactions to work with, for which we have the following features: Time, Amount, and 28 anonymized features (V1-V28) which are the result of a previously-performed PCA. We know that these anonymized features have already been scaled, but Time and Amount have not. We also have our target variable Class, which tells us if transactions are fraudulent.

Goals

  • Perform initial Exploratory Data Analysis.
  • Balance the data through random undersampling, and using this balanced data to determine which features are most highly correlated with fraud, attempt to address the most influential extreme outliers.
  • Train, cross-validate & test a variety of different baseline classifiers on our undersampled balanced data to determine which best recognizes fraud.
  • Cross-validate and test our best baseline model on the original data to determine its true performance.
  • Investigate if we can improve our baseline model by training it on oversampled (SMOTE) balanced data, rather than undersampled balanced data.
  • Train Neural Networks with both undersampling and oversampling approaches to test which peforms better.
  • Identify which model performed best overall.

Our approach includes:

  1. Exploratory Data Analysis & Preprocessing
    • View imbalance between fraud & safe transactions
    • Identify null values
    • View feature distributions
    • Perform feature scaling where needed
  2. Random Undersampling
    • Identify true correlations
    • Remove extreme outliers
  3. Dimensionality Reduction Gutcheck
    • Perform PCA, T-SNE, Truncated SVD
    • Infer how well our classifiers might performance
  4. Baseline Classification Models
    • Training & cross-validation on undersampled data
    • Optimize hyperparameters with GridSearchCV
    • View learning curves to check if models are under/overfitting
    • Check ROC curve performance
    • Test models with undersampled data, view confusion matrices
    • Determine which model best classifies fraud (logistic regression)
  5. Logistic Regression: Undersampling (NearMiss) Performance
    • Train logistic regression using undersampled (NearMiss) data
    • Cross-Validate on the original data, compare performance to cross-validation on undersampled data
    • Test performance on final test data, look at precision-recall score
  6. Logistic Regression: Oversampling (SMOTE) Performance
    • Train logistic regression using oversampled (SMOTE) data
    • Cross-Validate on the original data, compare performance to undersampled model
    • Cross-Validate on the undersampled test data, compare performance to undersampled model
    • Test performance on final test data, compare both under & oversampled logistic regression models
    • Using accuracy, precision-recall score, and confusion matrices, determine whether undersampling (NearMiss) or oversampling (SMOTE) gives better results for logistic regression
  7. Neural Networks
    • Create two simple neural network classifiers
    • Train the first using NearMiss undersampling, test performance with original test data
    • Train the second using SMOTE oversampling, test performance with original test data
    • Using confusion matrices, compare results
    • Determine which model performed best

Let's begin by loading the following libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold, cross_val_score, cross_val_predict, train_test_split
from sklearn.model_selection import StratifiedKFold, learning_curve, ShuffleSplit, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn import metrics
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, roc_curve, accuracy_score
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, average_precision_score
from scipy import stats
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD
import matplotlib.patches as mpatches
from imblearn.under_sampling import NearMiss
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline

Exploratory Data Analysis & Preprocessing

First, let's take a high-level view of our data to get a good sense of what we're working with. In particular, let's look at summary statistics for each feature, the balance between fraud & safe transactions, and null values.

In [2]:
txn = pd.read_csv('creditcard.csv')
In [3]:
txn
Out[3]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
284802 172786.0 -11.881118 10.071785 -9.834783 -2.066656 -5.364473 -2.606837 -4.918215 7.305334 1.914428 ... 0.213454 0.111864 1.014480 -0.509348 1.436807 0.250034 0.943651 0.823731 0.77 0
284803 172787.0 -0.732789 -0.055080 2.035030 -0.738589 0.868229 1.058415 0.024330 0.294869 0.584800 ... 0.214205 0.924384 0.012463 -1.016226 -0.606624 -0.395255 0.068472 -0.053527 24.79 0
284804 172788.0 1.919565 -0.301254 -3.249640 -0.557828 2.630515 3.031260 -0.296827 0.708417 0.432454 ... 0.232045 0.578229 -0.037501 0.640134 0.265745 -0.087371 0.004455 -0.026561 67.88 0
284805 172788.0 -0.240440 0.530483 0.702510 0.689799 -0.377961 0.623708 -0.686180 0.679145 0.392087 ... 0.265245 0.800049 -0.163298 0.123205 -0.569159 0.546668 0.108821 0.104533 10.00 0
284806 172792.0 -0.533413 -0.189733 0.703337 -0.506271 -0.012546 -0.649617 1.577006 -0.414650 0.486180 ... 0.261057 0.643078 0.376777 0.008797 -0.473649 -0.818267 -0.002415 0.013649 217.00 0

284807 rows × 31 columns

In [4]:
txn.describe().T
Out[4]:
count mean std min 25% 50% 75% max
Time 284807.0 9.481386e+04 47488.145955 0.000000 54201.500000 84692.000000 139320.500000 172792.000000
V1 284807.0 3.919560e-15 1.958696 -56.407510 -0.920373 0.018109 1.315642 2.454930
V2 284807.0 5.688174e-16 1.651309 -72.715728 -0.598550 0.065486 0.803724 22.057729
V3 284807.0 -8.769071e-15 1.516255 -48.325589 -0.890365 0.179846 1.027196 9.382558
V4 284807.0 2.782312e-15 1.415869 -5.683171 -0.848640 -0.019847 0.743341 16.875344
V5 284807.0 -1.552563e-15 1.380247 -113.743307 -0.691597 -0.054336 0.611926 34.801666
V6 284807.0 2.010663e-15 1.332271 -26.160506 -0.768296 -0.274187 0.398565 73.301626
V7 284807.0 -1.694249e-15 1.237094 -43.557242 -0.554076 0.040103 0.570436 120.589494
V8 284807.0 -1.927028e-16 1.194353 -73.216718 -0.208630 0.022358 0.327346 20.007208
V9 284807.0 -3.137024e-15 1.098632 -13.434066 -0.643098 -0.051429 0.597139 15.594995
V10 284807.0 1.768627e-15 1.088850 -24.588262 -0.535426 -0.092917 0.453923 23.745136
V11 284807.0 9.170318e-16 1.020713 -4.797473 -0.762494 -0.032757 0.739593 12.018913
V12 284807.0 -1.810658e-15 0.999201 -18.683715 -0.405571 0.140033 0.618238 7.848392
V13 284807.0 1.693438e-15 0.995274 -5.791881 -0.648539 -0.013568 0.662505 7.126883
V14 284807.0 1.479045e-15 0.958596 -19.214325 -0.425574 0.050601 0.493150 10.526766
V15 284807.0 3.482336e-15 0.915316 -4.498945 -0.582884 0.048072 0.648821 8.877742
V16 284807.0 1.392007e-15 0.876253 -14.129855 -0.468037 0.066413 0.523296 17.315112
V17 284807.0 -7.528491e-16 0.849337 -25.162799 -0.483748 -0.065676 0.399675 9.253526
V18 284807.0 4.328772e-16 0.838176 -9.498746 -0.498850 -0.003636 0.500807 5.041069
V19 284807.0 9.049732e-16 0.814041 -7.213527 -0.456299 0.003735 0.458949 5.591971
V20 284807.0 5.085503e-16 0.770925 -54.497720 -0.211721 -0.062481 0.133041 39.420904
V21 284807.0 1.537294e-16 0.734524 -34.830382 -0.228395 -0.029450 0.186377 27.202839
V22 284807.0 7.959909e-16 0.725702 -10.933144 -0.542350 0.006782 0.528554 10.503090
V23 284807.0 5.367590e-16 0.624460 -44.807735 -0.161846 -0.011193 0.147642 22.528412
V24 284807.0 4.458112e-15 0.605647 -2.836627 -0.354586 0.040976 0.439527 4.584549
V25 284807.0 1.453003e-15 0.521278 -10.295397 -0.317145 0.016594 0.350716 7.519589
V26 284807.0 1.699104e-15 0.482227 -2.604551 -0.326984 -0.052139 0.240952 3.517346
V27 284807.0 -3.660161e-16 0.403632 -22.565679 -0.070840 0.001342 0.091045 31.612198
V28 284807.0 -1.206049e-16 0.330083 -15.430084 -0.052960 0.011244 0.078280 33.847808
Amount 284807.0 8.834962e+01 250.120109 0.000000 5.600000 22.000000 77.165000 25691.160000
Class 284807.0 1.727486e-03 0.041527 0.000000 0.000000 0.000000 0.000000 1.000000
In [5]:
txn.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     284807 non-null  float64
 22  V22     284807 non-null  float64
 23  V23     284807 non-null  float64
 24  V24     284807 non-null  float64
 25  V25     284807 non-null  float64
 26  V26     284807 non-null  float64
 27  V27     284807 non-null  float64
 28  V28     284807 non-null  float64
 29  Amount  284807 non-null  float64
 30  Class   284807 non-null  int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB

No nulls!

In [6]:
txn_type = txn['Class'].apply(lambda x: 'Fraud' if x==1 else 'Not Fraud').value_counts()
print('There are {} fraud transactions ({:.2%})'.format(txn_type['Fraud'], txn_type['Fraud']/txn.shape[0]))
print('There are {} safe transactions ({:.2%})'.format(txn_type['Not Fraud'], txn_type['Not Fraud']/txn.shape[0]))
There are 492 fraud transactions (0.17%)
There are 284315 safe transactions (99.83%)

Awesome! This gives us a helpful insight: with only 0.17% fraud rate, we have a very imbalanced dataset, which we will need to account for when developing our predictive models.

Let's take a closer look at the distribution of each feature.

In [7]:
fig, axes = plt.subplots(7,4,figsize=(14,14))
feats = ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10','V11', 'V12', 'V13', 'V14', 'V15',
         'V16', 'V17', 'V18', 'V19', 'V20','V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28']
for i, ax in enumerate(axes.flatten()):
    ax.hist(txn[feats[i]], bins=25, color='green')
    ax.set_title(str(feats[i])+' Distribution', color='brown')
    ax.set_yscale('log')
plt.tight_layout()

max_val = np.max(txn[feats].values)
min_val = np.min(txn[feats].values)
print('All values range: ({:.2f}, {:.2f})'.format(min_val, max_val))
All values range: (-113.74, 120.59)

The anonymous V__ features tend to exhibit various distributions and ranges. Overall, values across all features fall within the range (-114, 121). Note we're using a logarithmic y-axis to help visualize the data more clearly.

Let's also look at the distributions for Time & Amount:

In [8]:
plt.figure(figsize=(14,6))
sns.distplot(txn['Time'])
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8135c30e10>
In [9]:
plt.figure(figsize=(14,6))
sns.distplot(txn['Amount'], hist=False, rug=True)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f81360b4690>

Scaling

Our next step in preparing the data for predictive modeling is to scale our features appropriately. We can infer that the V__ features of this dataset are already scaled, as they have already undergone PCA (in which scaling would have occurred). However, we still need to scale Amount and Time. To do this, we'll use RobustScaler(), which performs better on datasets with significant outliers (note the outliers in the Amount distribution above).

In [10]:
txn['Amount'] = RobustScaler().fit_transform(txn['Amount'].values.reshape(-1,1))
txn['Time'] = RobustScaler().fit_transform(txn['Time'].values.reshape(-1,1))

fig, (ax1, ax2) = plt.subplots(1,2,figsize=(14,4))
sns.distplot(txn['Time'], ax=ax1)
sns.distplot(txn['Amount'], hist=False, rug=True, ax=ax2)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f81396a4350>

Success! (Note the change in scale on the x-axis)

Now let's split our data into our final train and test populations. While we will train & cross-validate our models on various samples of the the training data (final_Xtrain, final_ytrain - more on this later), we will preserve this test data (final_Xtest, final_ytest) to evaluate the final performance of our models at the end of our analysis. Note that we make sure to stratify by our target variable, Class, in order to ensure both populations are representative.

In [11]:
X = txn.drop(['Class'], axis=1)
y = txn['Class']

final_Xtrain, final_Xtest, final_ytrain, final_ytest = train_test_split(X,
                                    y, test_size=0.2, stratify=y, random_state=42)

final_Xtrain = final_Xtrain.values
final_Xtest = final_Xtest.values
final_ytrain = final_ytrain.values
final_ytest = final_ytest.values

train_unique_label, train_counts_label = np.unique(final_ytrain, return_counts=True)
test_unique_label, test_counts_label = np.unique(final_ytest, return_counts=True)

print()
print('Proportions [Safe vs Fraud]')
print('Training %: '+ str(100*train_counts_label/len(final_ytrain)))
print('Testing %: '+ str(100*test_counts_label/len(final_ytest)))
Proportions [Safe vs Fraud]
Training %: [99.82707542  0.17292458]
Testing %: [99.82795548  0.17204452]

Awesome! Per our output, both our train & test populations have 99.83% safe and 0.17% fraud transactions.

Random Undersampling

Now that we've explored and scaled our data, we can focus on "balancing" our data.

Balancing our data means resampling our data such that we have a 50/50 split between fraud and safe transactions (as we just saw, we're pretty far from that at the moment with 99.83% safe vs 0.17% fraud). It is import that we create this 50/50 split before training our models because otherwise our models will overfit to the majority class, assuming that essentially all transactions are safe. We want to train our model to recognize the characteristics of fraud, not assume that most transactions aren't fraud. Thus we must create a 50/50 split to remove the effect of the imbalance during training.

Let's consider two ways to create this 50/50 split:

Random Undersampling: Here, we reduce the number of transactions in the majority class (i.e. safe transactions) randomly, until the counts of both the majority and minority classes are equal.

SMOTE Oversampling: Here, we increase the number of transactions in the minority class (i.e. fraud transactions) through creating synthetic transactions in between existing ones that are in close proximity. Synthetic transactions continue to be created until the counts of both the majority and minority classes are equal.

We will perform both sampling approaches to determine which performs better. Let's start with random undersampling:

In [12]:
# shuffle data first, so it's random
txn = txn.sample(frac=1)

txn_fraud = txn.loc[txn['Class']==1]
txn_safe = txn.loc[txn['Class']==0][:492]

txn_under = pd.concat([txn_fraud, txn_safe])
txn_under = txn_under.sample(frac=1, random_state=41)

txn_under.shape
Out[12]:
(984, 31)
In [13]:
txn_type = txn_under['Class'].apply(lambda x: 'Fraud' if x==1 else 'Not Fraud').value_counts()
print('Randomly Undersampled Dataset (txn_under):')
print('There are {} fraud transactions ({:.2%})'.format(txn_type['Fraud'], txn_type['Fraud']/txn_under.shape[0]))
print('There are {} safe transactions ({:.2%})'.format(txn_type['Not Fraud'], txn_type['Not Fraud']/txn_under.shape[0]))
Randomly Undersampled Dataset (txn_under):
There are 492 fraud transactions (50.00%)
There are 492 safe transactions (50.00%)

Awesome! Through random undersampling, we now have a balanced dataset, txn_under, with which we can predict fraudulent transactions without overfitting.

Next, let's create a correlation matrix to get a clear view of which features are most heavily correlated with fraud. Note that we should only use a balanced dataset (i.e. txn_under) to determine correlations, as per our discussion above, an imbalanced dataset (i.e. txn) overfits to the majority class. Let's create correlation matrices for both to illustrate the difference (but make sure you only reference the balanced matrix to determine correlations!).

In [14]:
fig, axes = plt.subplots(2, 1, figsize=(20,16))

sns.heatmap(txn_under.corr(), annot=True, fmt=".2f", cmap = 'RdYlGn', ax=axes[0])
axes[0].set_title("Balanced Correlation Matrix (Reference This One)", fontsize=20, fontweight='bold')

sns.heatmap(txn.corr(), annot=True, fmt=".2f", cmap = 'RdYlGn', ax=axes[1])
axes[1].set_title('Imbalanced Correlation Matrix', fontsize=20, fontweight='bold')

plt.tight_layout()

Great! As we can see, balancing our data gives very different (true) correlation values. Referencing the balanced output, we see fraud is most highly correlated with the following features:

Negative Correlations: V14, V12, V10

Positive Correlations: V11, V4

Let's view these features a bit more closely...

In [15]:
fig, axes = plt.subplots(2,3,figsize=(14,8))

high_corr_feats = ['V14', 'V12', 'V10', 'V11', 'V4']

for i, ax in enumerate(axes.flatten()):
    if i == 5:
        ax.axis('off')
        break
    sns.boxplot(x='Class', y=high_corr_feats[i], data=txn_under, ax=ax, palette=sns.color_palette('magma_r', 2))
    ax.set_ylabel(None)
    ax.set_title(label=high_corr_feats[i], fontsize=16, fontweight='bold')
plt.tight_layout()

Awesome! Just as we should expect -- fraud transactions have lower values for features that are negatively correlated (V14, V12, V10) and higher values for features that are positively correlated (V11, V4).

Note the presence of outliers in some of these features. We want to remove extreme fraud outliers from the most highly correlated features. This is an important preprocessing step for maximizing model performance on recognizing fraud. Addressing outliers in features with the highest correlations will help ensure we are addressing only the most impactful outliers, and removing only the most extreme outliers will help us prevent unnecessary loss of information.

We will use the IQR to identify outliers. But first, let's get a better view of the distributions for each feature, to ensure they are roughly normal, and thus our outlier removal approach makes sense.

In [16]:
fig, axes = plt.subplots(2,3,figsize=(14,7))
fig.suptitle('    Fraud Transaction Distributions', fontsize=20, fontweight='bold')

for i, ax in enumerate(axes.flatten()):
    if i == 5:
        ax.axis('off')
        break
    v_fraud = txn_under[txn_under['Class']==1][high_corr_feats[i]].values
    sns.distplot(v_fraud, ax=ax, fit=stats.norm)
    ax.set_title(str(high_corr_feats[i]), fontsize=12)

These all look normal enough to proceed. We're ready to remove outliers:

In [17]:
len(txn_under)
Out[17]:
984
In [18]:
high_corr_feats2 = ['V14', 'V12', 'V10', 'V11', 'V4']

for i in high_corr_feats2:
    v_fraud = txn_under[txn_under['Class']==1][i]

    q75 = np.percentile(v_fraud, 75)
    q25 = np.percentile(v_fraud, 25)
    iqr = q75 - q25

    v_lower, v_upper = q25-1.5*iqr, q75+1.5*iqr
    outliers = [x for x in v_fraud if x > v_upper or x < v_lower]

    print(str(len(outliers))+' '+str(i)+' fraud outliers: '+str(outliers)+'\n')

    txn_under = txn_under.drop(txn_under.index[txn_under[i].isin(outliers) &
                                     txn_under['Class']==1])
4 V14 fraud outliers: [-18.049997689859396, -18.4937733551053, -19.2143254902614, -18.8220867423816]

4 V12 fraud outliers: [-18.4311310279993, -18.553697009645802, -18.047596570821604, -18.683714633344298]

27 V10 fraud outliers: [-22.1870885620007, -16.6011969664137, -16.6496281595399, -16.7460441053944, -15.2318333653018, -16.2556117491401, -20.949191554361104, -22.1870885620007, -24.403184969972802, -22.1870885620007, -18.2711681738888, -14.9246547735487, -15.124162814494698, -17.141513641289198, -16.3035376590131, -15.2399619587112, -24.5882624372475, -23.2282548357516, -15.346098846877501, -15.2399619587112, -18.9132433348732, -14.9246547735487, -19.836148851696, -15.563791338730098, -15.563791338730098, -15.1237521803455, -22.1870885620007]

13 V11 fraud outliers: [10.187587324166401, 11.0270590938161, 9.81570317447819, 10.446846814514, 9.691460982073188, 10.0637897462894, 9.567110295213972, 11.152490598583698, 9.939819741725689, 11.277920727806698, 10.8530116481991, 10.5452629545898, 10.2777688628065]

2 V4 fraud outliers: [11.805469210591301, 11.7861803616399]

In [19]:
len(txn_under)
Out[19]:
934
In [20]:
fig, axes = plt.subplots(2, 3, figsize=(20,12))
fig.suptitle('    Outlier Reduction', fontsize=20, fontweight='bold')

loc1 = [(0.98, -17.5), (0.98, -17.3), (0.98, -14.5), (0.98, 9.2), (0.98, 10.8)]
loc2 = [(0, -12), (0, -12), (0, -12), (0, 6), (0, 8)]

for i, ax in enumerate(axes.flatten()):
    if i == 5:
        ax.axis('off')
        break
    sns.boxplot(x="Class", y=high_corr_feats[i], data=txn_under, ax=ax, palette=sns.color_palette('magma_r', 2))
    ax.set_title(str(high_corr_feats[i]), fontsize=16, fontweight='bold')
    ax.annotate('Fewer extreme\n     outliers', xy=loc1[i], xytext=loc2[i],
                arrowprops=dict(facecolor='Red'), fontsize=14)
    ax.set_ylabel('')

Great! We successfully removed the most extreme outliers from our most highly-correlated features.

Gutcheck - Dimensionality Reduction

Before we proceed with creating classification models, let's first get a sense of how effective our models might be through performing dimensionality reduction on our data.

Specifically, we'll use three dimensionality reduction methods (t-SNE, PCA & Truncated SVD) to reduce the number of our features to just two. We will then graph the results on the xy-plane, highlighting fraud and safe transactions differently. If we're able to see a clear separation between classes in the graphs, that will give us an indication that further predictive models may perform well at classifying fraud.

In [21]:
X = txn_under.drop('Class', axis=1)
y = txn_under['Class']

# Implement dimensionality reductions
X_pca = PCA(n_components=2, random_state=38).fit_transform(X.values)
X_svd = TruncatedSVD(n_components=2, algorithm='randomized', random_state=37).fit_transform(X.values)
X_tsne = TSNE(n_components=2, random_state=39).fit_transform(X.values)
In [22]:
f, axes = plt.subplots(1, 3, figsize=(24,6))
# labels = ['No Fraud', 'Fraud']
f.suptitle('    Dimensionality Reductions', fontsize=20, fontweight='bold')

green_patch = mpatches.Patch(color='darkgreen', label='No Fraud')
red_patch = mpatches.Patch(color='darkred', label='Fraud')

dim_red = [X_pca, X_svd, X_tsne]
titles = ['PCA', 'Truncated SVD', 't-SNE']

for i, ax in enumerate(axes):
    ax.scatter(dim_red[i][:,0], dim_red[i][:,1], c=(y == 0), cmap='RdYlGn', label='No Fraud', linewidths=2)
    ax.scatter(dim_red[i][:,0], dim_red[i][:,1], c=(y == 1), cmap='RdYlGn', label='Fraud', linewidths=2)
    ax.set_title(titles[i], fontsize=20)
    ax.grid(True)
    ax.legend(handles=[green_patch, red_patch])

Fantastic! We see a clear separation between fraud & safe transactions in all 3 graphs, especially in PCA and Truncated SVD! This gives us a good indication that our predictive models will be able to effectively classify fraud.

Baseline Classification Models

We're finally ready to create our classification models! We'll start with four types of classifiers: Logistic Regression, K Nearest Neighbors, SVC, and Decision Tree.

First, we'll train and cross-validate our models on the txn_under dataset to get a sense of which model does the best job of recognizing fraud. Once we have an opinion on which model does best, we'll cross-validate that model on the unsampled (original) data to get an objective view of its true performance.

A Note on Cross-Validating Imbalanced Data:

Before we proceed further, let's digress for a moment to talk about a common mistake made when dealing with imbalanced datasets. While we should train our models on balanced data, we should cross-validate on imbalanced (i.e. original) data to best evaluate the objective performance of our trained models. Cross-validating on the original data gives us the most accurate view of our model's performance in production, because the original data preserves and is representative of the imbalance we would expect to see in production.

Our initial goal here, however, isn't to determine objective performance, but rather to determine which model does the best job of identifying the characteristics of fraud. In this case, it is okay for us to cross-validate with our undersampled data, as long as we are consistent in our approach across all models. We will still be able to rank models' abilities to identify fraud by cross-validating on undersampled data, even if the scores we derive are not representative of what we would expect to see in production.

Let's start by splitting our undersampled data, txn_under, into train and test sets:

In [23]:
X = txn_under.drop('Class', axis=1)
y = txn_under['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [24]:
models = {"Log Reg": LogisticRegression(), "KNN": KNeighborsClassifier(), "SVC": SVC(),
          "D Tree": DecisionTreeClassifier()}

print('Mean cv accuracy on undersampled data. \n')
for name, model in models.items():
    training_acc = cross_val_score(model, X_train, y_train, cv=5)
    print(name+":", str(round(training_acc.mean()*100, 2))+"%")
Mean cv accuracy on undersampled data.

Log Reg: 92.77%
KNN: 91.04%
SVC: 91.84%
D Tree: 87.28%

Despite the fact that these scores are not representative of the true production accuracy, we do see that all four models do a very good job (~90% or more) of identifying fraud... which is good, and inline with what we might expect given what we saw from our dimensionality reduction earlier. Let's see how much we can improve these scores further by optimizing our hyperparameters with GridSearchCV. Again, we'll perform cross-validation on the balanced data so that we can compare to the baseline scores above:

In [25]:
# Use GridSearchCV to find the best parameters.

print('Mean cv scores on undersampled data after tuning hyperparameters. \n')

# Logistic Regression 
log_reg_params = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]}
grid_log_reg = GridSearchCV(LogisticRegression(max_iter=10000), log_reg_params)
grid_log_reg.fit(X_train, y_train)
log_reg = grid_log_reg.best_estimator_
log_reg_score = cross_val_score(log_reg, X_train, y_train, cv=5)
print('Log Reg: ', round(log_reg_score.mean() * 100, 2).astype(str) + '%')

# K Nearest Neighbors
knn_params = {"n_neighbors": list(range(2,6,1)), 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}
grid_knn = GridSearchCV(KNeighborsClassifier(), knn_params)
grid_knn.fit(X_train, y_train)
knn = grid_knn.best_estimator_
knn_score = cross_val_score(knn, X_train, y_train, cv=5)
print('KNN:     ', round(knn_score.mean() * 100, 2).astype(str) + '%')

# SVC
svc_params = {'C': [0.5, 0.6, 0.7, 0.8, 0.9, 1], 'kernel': ['rbf', 'poly', 'sigmoid', 'linear']}
grid_svc = GridSearchCV(SVC(), svc_params)
grid_svc.fit(X_train, y_train)
svc = grid_svc.best_estimator_
svc_score = cross_val_score(svc, X_train, y_train, cv=5)
print('SVC:     ', round(svc_score.mean() * 100, 2).astype(str) + '%')

# DescisionTree
tree_params = {"criterion": ["gini", "entropy"], "max_depth": list(range(2,4,1)),
              "min_samples_leaf": list(range(3,7,1))}
grid_tree = GridSearchCV(DecisionTreeClassifier(), tree_params)
grid_tree.fit(X_train, y_train)
tree = grid_tree.best_estimator_
tree_score = cross_val_score(tree, X_train, y_train, cv=5)
print('D Tree:  ', str(round(tree_score.mean() * 100, 2)) + '%')
Mean cv scores on undersampled data after tuning hyperparameters.

Log Reg:  93.17%
KNN:      91.44%
SVC:      92.5%
D Tree:   90.76%

Awesome! We were able to improve our cross-validation scores.

Let's plot learning curves to get a sense of whether our models are over/underfitting. Note that the wider the gap between our training and cross-validation scores, the more likely we are overfitting:

In [26]:
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=42)

fig, axes = plt.subplots(2,2, figsize=(18,12), sharey=True)
classifier = [log_reg, knn, svc, tree]
titles = ["Logistic Regression Learning Curve", "K Nearest Neighbors Learning Curve",
         "Support Vector Classifier Learning Curve", "Decision Tree Classifier Learning Curve"]

for i, ax in enumerate(axes.flatten()):
    train_sizes, train_acc, test_acc = learning_curve(
        classifier[i], X_train, y_train, cv=cv, n_jobs=4, train_sizes=np.linspace(.1, 1.0, 10))
    train_acc_mean = np.mean(train_acc, axis=1)
    train_acc_std = np.std(train_acc, axis=1)
    test_acc_mean = np.mean(test_acc, axis=1)
    test_acc_std = np.std(test_acc, axis=1)
    ax.fill_between(train_sizes, train_acc_mean - train_acc_std, train_acc_mean + train_acc_std, alpha=0.3, color="#b2b8b7")
    ax.fill_between(train_sizes, test_acc_mean - test_acc_std, test_acc_mean + test_acc_std, alpha=0.3, color="#46d448")
    ax.plot(train_sizes, train_acc_mean, 'o-', color="#b2b8b7", label="Training accuracy")
    ax.plot(train_sizes, test_acc_mean, 'o-', color="#46d448", label="Cross-validation accuracy")
    ax.set_title(titles[i], fontsize=14)
    ax.set_xlabel('Training size')
    ax.set_ylabel('Accuracy')
    ax.grid(True)
    ax.legend(loc='upper right')

plt.ylim(0.86, 1.01);

Great! We don't appear to be overfitting given the lack of a large gap between curves as training size increases, and we don't appear to be underfitting given how high our scores are.

Let's take a look at the Receiver Operating Characteristic (ROC) curve for each of our models, to get a better sense of their performance. ROC curves are used to show in a graphical way the connection/trade-off between False Positive Rate (the percent of safe transactions our model classifies incorrectly) and True Positive Rate (the percentage of fraud transactions our model classifies correctly) for every possible decision boundary in our models. In other words, it shows the tradeoff between our incorrect classification of safe transactions and our correct classification of fraud transactions as we move through every possible decision boundary.

To evaluate model performance, we'll calculate the area under the ROC curve (AUC). This tells us which model does the best job of distinguishing between fraud and safe transactions overall.

In [27]:
print ('Model ROC AUC \n')

log_reg_pred = cross_val_predict(log_reg, X_train, y_train, cv=5, method="decision_function")
print('Log Reg: {:.4f}'.format(roc_auc_score(y_train, log_reg_pred)))

knn_pred = cross_val_predict(knn, X_train, y_train, cv=5)
print('KNN: {:.4f}'.format(roc_auc_score(y_train, knn_pred)))

svc_pred = cross_val_predict(svc, X_train, y_train, cv=5, method="decision_function")
print('SVC: {:.4f}'.format(roc_auc_score(y_train, svc_pred)))

tree_pred = cross_val_predict(tree, X_train, y_train, cv=5)
print('D Tree: {:.4f}'.format(roc_auc_score(y_train, tree_pred)))
Model ROC AUC

Log Reg: 0.9648
KNN: 0.9099
SVC: 0.9611
D Tree: 0.9078
In [28]:
log_fpr, log_tpr, log_thresold = roc_curve(y_train, log_reg_pred)
knn_fpr, knn_tpr, knn_threshold = roc_curve(y_train, knn_pred)
svc_fpr, svc_tpr, svc_threshold = roc_curve(y_train, svc_pred)
tree_fpr, tree_tpr, tree_threshold = roc_curve(y_train, tree_pred)

plt.figure(figsize=(12,6))
plt.title('ROC Curves', fontsize=18)
plt.plot(log_fpr, log_tpr, label='Logistic Regression Classifier Score: {:.4f}'.format(roc_auc_score(y_train, log_reg_pred)))
plt.plot(knn_fpr, knn_tpr, label='K Nearest Neighbors Classifier Score: {:.4f}'.format(roc_auc_score(y_train, knn_pred)))
plt.plot(svc_fpr, svc_tpr, label='Support Vector Classifier Score: {:.4f}'.format(roc_auc_score(y_train, svc_pred)))
plt.plot(tree_fpr, tree_tpr, label='Decision Tree Classifier Score: {:.4f}'.format(roc_auc_score(y_train, tree_pred)))
plt.plot([0, 1], [0, 1], 'k--')
plt.axis([-0.01, 1, 0, 1])
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.annotate('Minimum Possible ROC Score (50%)', xy=(0.5, 0.5), xytext=(0.6, 0.3),
            arrowprops=dict(facecolor='Red', shrink=0.05))
plt.legend()
Out[28]:
<matplotlib.legend.Legend at 0x7f8135cf7110>

Awesome! With the highest ROC_AUC score and undersampled cross-validation accuracy, logistic regression looks like it could be our best model...

We're finally ready to test our models with our X_test/y_test data to determine which model does the best at recognizing fraud (keep in mind, this is still not the original/imbalanced data).

In [29]:
# Predict on X_test
log_reg_pred2 = log_reg.predict(X_test)
knn_pred2 = knn.predict(X_test)
svc_pred2 = svc.predict(X_test)
tree_pred2 = tree.predict(X_test)

log_reg_cf = confusion_matrix(y_test, log_reg_pred2)
knn_cf = confusion_matrix(y_test, knn_pred2)
svc_cf = confusion_matrix(y_test, svc_pred2)
tree_cf = confusion_matrix(y_test, tree_pred2)

fig, axes = plt.subplots(2, 2,figsize=(18,10))
titles = ['Logistic Regression', 'K Nearest Neighbors', 'Suppor Vector Classifier', 'DecisionTree Classifier']
conf_matrix = [log_reg_cf, knn_cf, svc_cf, tree_cf]

fig.suptitle('Confusion Matrices (NearMiss Undersampling)     ', fontsize=20, fontweight='bold')

for i, ax in enumerate(axes.flatten()):
    sns.heatmap(conf_matrix[i], ax=ax, annot=True, fmt='.0f', cmap='magma')
    ax.set_title(titles[i], fontsize=14)
    ax.set_xticklabels(['Predicted\nSafe', 'Predicted\nFraud'], fontsize=10)
    ax.set_yticklabels(['Safe', 'Fraud'], fontsize=10)
In [30]:
print('Logistic Regression:')
print(classification_report(y_test, log_reg_pred2))

print('K Nearest Neighbors:')
print(classification_report(y_test, knn_pred2))

print('Support Vector Classifier:')
print(classification_report(y_test, svc_pred2))

print('DecisionTree Classifier:')
print(classification_report(y_test, tree_pred2))
Logistic Regression:
              precision    recall  f1-score   support

           0       0.96      0.97      0.96        91
           1       0.97      0.96      0.96        96

    accuracy                           0.96       187
   macro avg       0.96      0.96      0.96       187
weighted avg       0.96      0.96      0.96       187

K Nearest Neighbors:
              precision    recall  f1-score   support

           0       0.94      0.99      0.96        91
           1       0.99      0.94      0.96        96

    accuracy                           0.96       187
   macro avg       0.96      0.96      0.96       187
weighted avg       0.96      0.96      0.96       187

Support Vector Classifier:
              precision    recall  f1-score   support

           0       0.95      0.97      0.96        91
           1       0.97      0.95      0.96        96

    accuracy                           0.96       187
   macro avg       0.96      0.96      0.96       187
weighted avg       0.96      0.96      0.96       187

DecisionTree Classifier:
              precision    recall  f1-score   support

           0       0.87      0.98      0.92        91
           1       0.98      0.86      0.92        96

    accuracy                           0.92       187
   macro avg       0.92      0.92      0.92       187
weighted avg       0.93      0.92      0.92       187

Awesome! Based on our undersampled test data, logistic regression performed the best at recognizing fraud. We will continue our analysis with just Logistic Regression.

Logistic Regression: Undersampling (NearMiss) Performance

First, let's come back to our previous conversation on the proper way to cross-validate for estimating true performance. So far, we've only cross-validated using balanced/undersampled data. Let's now cross-validate logistic regression using the original, imbalanced data.

Recall that we still have to train our model on balanced data. To do this, we'll implement the NearMiss() algorithm, which is an undersampling technique that works by randomly removing samples of the majority class that are close to samples of the minority class, thus clarifying the boundary between both classes and improving model classification.

In [31]:
accuracy_undersample = []
precision_undersample = []
recall_undersample = []
f1_undersample = []
auc_undersample = []

# Cross-Validating correctly to determine real-world performance
sss = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)

for train, test in sss.split(final_Xtrain, final_ytrain):
    pipeline_undersample = imbalanced_make_pipeline(NearMiss(), log_reg)
    model_undersample = pipeline_undersample.fit(final_Xtrain[train], final_ytrain[train])
    prediction_undersample = model_undersample.predict(final_Xtrain[test])

    accuracy_undersample.append(pipeline_undersample.score(final_Xtrain[test], final_ytrain[test]))
    precision_undersample.append(precision_score(final_ytrain[test], prediction_undersample))
    recall_undersample.append(recall_score(final_ytrain[test], prediction_undersample))
    f1_undersample.append(f1_score(final_ytrain[test], prediction_undersample))
    auc_undersample.append(roc_auc_score(final_ytrain[test], prediction_undersample))

Now let's print our performance metrics for both cross-validation approaches (balanced vs imbalanced data) to see how they compare.

In [32]:
precision, recall, threshold = precision_recall_curve(y_train, log_reg_pred)

y_pred = log_reg.predict(X_train)

# Overfit
print('Cross-Validating on Undersampled/Balanced Data: \n')
print('Accuracy Score: {:.4f}'.format(accuracy_score(y_train, y_pred)))
print('Precision Score: {:.4f}'.format(precision_score(y_train, y_pred)))
print('Recall Score: {:.4f}'.format(recall_score(y_train, y_pred)))
print('F1 Score: {:.4f}'.format(f1_score(y_train, y_pred)))
print('---' * 20)

# True
print('Cross-Validating on Original/Imbalanced Data: \n')
print("Accuracy Score: {:.4f}".format(np.mean(accuracy_undersample)))
print("Precision Score: {:.4f}".format(np.mean(precision_undersample)))
print("Recall Score: {:.4f}".format(np.mean(recall_undersample)))
print("F1 Score: {:.4f}".format(np.mean(f1_undersample)))
Cross-Validating on Undersampled/Balanced Data:

Accuracy Score: 0.7189
Precision Score: 0.6303
Recall Score: 0.9509
F1 Score: 0.7581
------------------------------------------------------------
Cross-Validating on Original/Imbalanced Data:

Accuracy Score: 0.5360
Precision Score: 0.0036
Recall Score: 0.9492
F1 Score: 0.0072

Awesome! While we tend to see lower scores when cross-validating on the original/imbalanced data, these differences make sense.

For example, we would expect to see much lower precision when cross-validating on the original/imbalanced data than the undersampled/balanced data, simply because the imbalanced data consists of far more safe transactions. Thus, we would expect a higher number to be misclassified as fraud, lowering precision.

Note, however, that recall remains relatively unchanged. This also makes sense, as we're cross-validating on the same (i.e. unsampled) set of fraud transactions in both scenarios.

Now we're ready to test our logistic regression model (trained on NearMiss undersampled data) with our final_Xtest and final_ytest data. We'll use both precision and recall to evaluate performance:

In [33]:
y_score_under = log_reg.decision_function(final_Xtest)
undersample_average_precision = average_precision_score(final_ytest, y_score_under)

fig = plt.figure(figsize=(14,5))

precision, recall, _ = precision_recall_curve(final_ytest, y_score_under)

plt.step(recall, precision, color='Green', alpha=0.3, where='post')
plt.fill_between(recall, precision, step='post', alpha=0.3, color='Green')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.title('Undersampled Precision-Recall Curve (Avg Score: {0:0.4f})'.format(undersample_average_precision), fontsize=16);

Our average precision-recall score on our test data is objectively low. This may seem suspiciously low after seeing our ROC curves above, but remember, we're now evaluating our model on the original/imbalanced data. As we can see in the graph, our model allows many more safe transactions in (i.e. missclassifying them) as we relax our decision boundary.

Logistic Regression: Oversampling (SMOTE) Performance

Now let's proceed with our oversampling approach! We'll be using the SMOTE technique to oversample.

SMOTE

Unlike with the undersampling approach in which we randomly remove data from the majority class, with SMOTE we create new synthetic points for the minority class to achieve class balance. Specifically, SMOTE creates synthetic points between closest-neighbors of the minority class. This approach retains more information than undersampling, as we aren't removing any data from the original dataset. However, because we have to train on more data with SMOTE, the approach is more time/resource-intensive.

Cross-Validation

Again, while we should train our model on the oversampled/balanced data, it's important that we cross-validate on the original data. Overfitting can result if we cross-validate on the oversampled data, as this population contains additional synthetic samples and does not represent the true imbalance between classes. As with undersampling, we want to cross-validate on the original/imbalanced data to get an objective view of model performance.

In [34]:
accuracy_oversample = []
precision_oversample = []
recall_oversample = []
f1_oversample = []
auc_oversample = []

# Classifier with optimal parameters - we use RandomizedSearch instead of GridSearch, given large sample size.
log_reg_params = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
rand_log_reg = RandomizedSearchCV(LogisticRegression(random_state=4, max_iter=1000), log_reg_params, n_iter=4)

for train, test in sss.split(final_Xtrain, final_ytrain):
    # Apply SMOTE during training, not cross-validation.
    pipeline = imbalanced_make_pipeline(SMOTE(sampling_strategy='minority'), rand_log_reg)
    model = pipeline.fit(final_Xtrain[train], final_ytrain[train])
    log_reg_sm = rand_log_reg.best_estimator_
    prediction = log_reg_sm.predict(final_Xtrain[test])

    accuracy_oversample.append(pipeline.score(final_Xtrain[test], final_ytrain[test]))
    precision_oversample.append(precision_score(final_ytrain[test], prediction))
    recall_oversample.append(recall_score(final_ytrain[test], prediction))
    f1_oversample.append(f1_score(final_ytrain[test], prediction))
    auc_oversample.append(roc_auc_score(final_ytrain[test], prediction))

print('Cross-Validation on Original/Imbalanced Data (Correct Approach, SMOTE)')
print('')
print("accuracy: {:.4f}".format(np.mean(accuracy_oversample)))
print("precision: {:.4f}".format(np.mean(precision_oversample)))
print("recall: {:.4f}".format(np.mean(recall_oversample)))
print("f1: {:.4f}".format(np.mean(f1_oversample)))
print("auc: {:.4f}".format(np.mean(auc_oversample)))
Cross-Validation on Original/Imbalanced Data (Correct Approach, SMOTE)

accuracy: 0.9744
precision: 0.0589
recall: 0.9161
f1: 0.1106
auc: 0.9453

Interesting! We tend to see higher cross-validation scores (precision, accuracy, f1) when training with oversampled data than when training with undersampled data from earlier in our analysis... which makes sense.

For example, we see significantly higher precision when training on the oversampled data. The reason for this is that we don't lose any information on the safe transaction set during training (e.g. by randomly removing data) in our oversampling approach, which allows our model to more thoroughly learn the characteristics of safe transactions, and thus missclassify them less often, increasing precision.

Note, however, that recall remained relatively unchanged. This is a testament to how well SMOTE is able to add new synthetic fraud transactions to the training data while maintaining the original distribution/characteristics of fraud.

Out of curiosity, let's compare how our logistic regression models trained on undersampled data (log_reg) vs oversampled data (log_reg_sm) perform on the undersampled test data (X_test & y_test).

In [35]:
# Predict on X_test
log_reg_sm_pred = log_reg_sm.predict(X_test)

log_reg_sm_cf = confusion_matrix(y_test, log_reg_sm_pred)

fig, axes = plt.subplots(1, 2,figsize=(18,6))
titles = ['SMOTE Oversampling', 'NearMiss Undersampling']

conf_matrix = [log_reg_sm_cf, log_reg_cf]

fig.suptitle('Logistic Regression Confusion Matrices     ', fontsize=20, fontweight='bold')

for i, ax in enumerate(axes.flatten()):
    sns.heatmap(conf_matrix[i], ax=ax, annot=True, fmt='.0f', cmap='magma')
    ax.set_title(titles[i], fontsize=14)
    ax.set_xticklabels(['Predicted\nSafe', 'Predicted\nFraud'], fontsize=10)
    ax.set_yticklabels(['Safe', 'Fraud'], fontsize=10)
In [36]:
labels = ['No Fraud', 'Fraud']

print('Performance on Undersampled Test Data \n')
print('Logistic Regression, SMOTE Oversampling:')
print(classification_report(y_test, log_reg_sm_pred, target_names=labels))

print('Logistic Regression, NearMiss Undersampling:')
print(classification_report(y_test, log_reg_pred2, target_names=labels))
Performance on Undersampled Test Data

Logistic Regression, SMOTE Oversampling:
              precision    recall  f1-score   support

    No Fraud       0.98      0.97      0.97        91
       Fraud       0.97      0.98      0.97        96

    accuracy                           0.97       187
   macro avg       0.97      0.97      0.97       187
weighted avg       0.97      0.97      0.97       187

Logistic Regression, NearMiss Undersampling:
              precision    recall  f1-score   support

    No Fraud       0.96      0.97      0.96        91
       Fraud       0.97      0.96      0.96        96

    accuracy                           0.96       187
   macro avg       0.96      0.96      0.96       187
weighted avg       0.96      0.96      0.96       187

Awesome, SMOTE slightly improved logistic regression's performance on the undersampled test data!

Let's now see which model performs best on the original test data (final_Xtest, final_ytest). We'll start with accuracy:

In [37]:
print('Logistic Regression Performance, Final Testing:\n')
# Logistic regression trained on undersampled data
y_pred = log_reg.predict(final_Xtest)
undersample_accuracy = accuracy_score(final_ytest, y_pred)
print('Undersampling Accuracy: {:.4f}'.format(undersample_accuracy))

# Logistic regression trained on oversampled data
y_pred_sm = log_reg_sm.predict(final_Xtest)
oversample_accuracy = accuracy_score(final_ytest, y_pred_sm)
print('Oversampling Accuracy: {:.4f}'.format(oversample_accuracy))
Logistic Regression Performance, Final Testing:

Undersampling Accuracy: 0.5383
Oversampling Accuracy: 0.9751

Wow! Oversampling with SMOTE greatly improved accuracy on the original test data. Let's compare precision and recall as well:

In [38]:
labels = ['No Fraud', 'Fraud']

print('Logistic Regression Performance, Final Testing: \n')

print('Logistic Regression, NearMiss Undersampling:')
print(classification_report(final_ytest, y_pred, target_names=labels))

print('Logistic Regression, SMOTE Oversampling:')
print(classification_report(final_ytest, y_pred_sm, target_names=labels))
Logistic Regression Performance, Final Testing:

Logistic Regression, NearMiss Undersampling:
              precision    recall  f1-score   support

    No Fraud       1.00      0.54      0.70     56864
       Fraud       0.00      0.96      0.01        98

    accuracy                           0.54     56962
   macro avg       0.50      0.75      0.35     56962
weighted avg       1.00      0.54      0.70     56962

Logistic Regression, SMOTE Oversampling:
              precision    recall  f1-score   support

    No Fraud       1.00      0.98      0.99     56864
       Fraud       0.06      0.92      0.11        98

    accuracy                           0.98     56962
   macro avg       0.53      0.95      0.55     56962
weighted avg       1.00      0.98      0.99     56962

In [39]:
y_score_over = log_reg_sm.decision_function(final_Xtest)
oversample_average_precision = average_precision_score(final_ytest, y_score_over)

fig = plt.figure(figsize=(14,5))

precision, recall, _ = precision_recall_curve(final_ytest, y_score_over)

plt.step(recall, precision, color='Red', alpha=0.3, where='post')
plt.fill_between(recall, precision, step='post', alpha=0.3, color='Orange')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.title('Oversampled Precision-Recall Curve (Avg Score: {0:0.4f})'.format(oversample_average_precision), fontsize=16);

Awesome! SMOTE oversampling gives much higher accuracy overall, as well as a higher precision and f1 score for fraud transactions. Additionally, as we can see in our precision-recall graph directly above, training with SMOTE resulted in a much better precision-recall score overall. In comparison to our undersampling precision-recall curve, we now let far fewer safe transactions in (i.e. missclassifying them) as we relax our decision boundary. Clearly, the information loss through undersampling negatively affected the model's ability to correctly classify safe transactions.

Neural Networks

Now let's implement two simple neural networks (NNs) to see how they perform!

To create our NNs, let's have one input layer with the same number of nodes as features plus a bias node, a second hidden layer with 32 nodes, and one output node classifying the transaction as 0 (safe) or 1 (fraud).

First, we'll build our undersampled model.

In [40]:
import itertools
import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers import Activation
from keras.layers.core import Dense
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy

Undersampled NN

In [41]:
NN_undersample = Sequential([Dense(X_train.shape[1], input_shape=(X_train.shape[1], ), activation='relu'),
                             Dense(32, activation='relu'),
                             Dense(2, activation='softmax')])
NN_undersample.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
dense (Dense)                (None, 30)                930
_________________________________________________________________
dense_1 (Dense)              (None, 32)                992
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 66
=================================================================
Total params: 1,988
Trainable params: 1,988
Non-trainable params: 0
_________________________________________________________________

Great! Now, let's train this network on the undersampled data, then we'll predict on the final test data.

In [42]:
NN_undersample.compile(Adam(lr=0.001), metrics=['accuracy'], loss='sparse_categorical_crossentropy')

NN_undersample.fit(X_train, y_train, validation_split=0.2, batch_size=25, epochs=20, shuffle=True, verbose=2)

undersample_pred = NN_undersample.predict_classes(final_Xtest)
Epoch 1/20
24/24 - 0s - loss: 0.7087 - accuracy: 0.6114 - val_loss: 0.4544 - val_accuracy: 0.7400
Epoch 2/20
24/24 - 0s - loss: 0.4155 - accuracy: 0.8174 - val_loss: 0.3537 - val_accuracy: 0.8667
Epoch 3/20
24/24 - 0s - loss: 0.3353 - accuracy: 0.8710 - val_loss: 0.3053 - val_accuracy: 0.8800
Epoch 4/20
24/24 - 0s - loss: 0.2857 - accuracy: 0.8995 - val_loss: 0.2668 - val_accuracy: 0.9067
Epoch 5/20
24/24 - 0s - loss: 0.2472 - accuracy: 0.9112 - val_loss: 0.2446 - val_accuracy: 0.9067
Epoch 6/20
24/24 - 0s - loss: 0.2221 - accuracy: 0.9213 - val_loss: 0.2290 - val_accuracy: 0.9067
Epoch 7/20
24/24 - 0s - loss: 0.2008 - accuracy: 0.9263 - val_loss: 0.2163 - val_accuracy: 0.9200
Epoch 8/20
24/24 - 0s - loss: 0.1865 - accuracy: 0.9330 - val_loss: 0.2091 - val_accuracy: 0.9133
Epoch 9/20
24/24 - 0s - loss: 0.1744 - accuracy: 0.9380 - val_loss: 0.2045 - val_accuracy: 0.9133
Epoch 10/20
24/24 - 0s - loss: 0.1645 - accuracy: 0.9414 - val_loss: 0.2020 - val_accuracy: 0.9133
Epoch 11/20
24/24 - 0s - loss: 0.1556 - accuracy: 0.9464 - val_loss: 0.1985 - val_accuracy: 0.9133
Epoch 12/20
24/24 - 0s - loss: 0.1476 - accuracy: 0.9514 - val_loss: 0.1985 - val_accuracy: 0.9133
Epoch 13/20
24/24 - 0s - loss: 0.1413 - accuracy: 0.9497 - val_loss: 0.1969 - val_accuracy: 0.9133
Epoch 14/20
24/24 - 0s - loss: 0.1353 - accuracy: 0.9514 - val_loss: 0.1990 - val_accuracy: 0.9067
Epoch 15/20
24/24 - 0s - loss: 0.1304 - accuracy: 0.9531 - val_loss: 0.1974 - val_accuracy: 0.9133
Epoch 16/20
24/24 - 0s - loss: 0.1257 - accuracy: 0.9581 - val_loss: 0.2004 - val_accuracy: 0.9133
Epoch 17/20
24/24 - 0s - loss: 0.1194 - accuracy: 0.9581 - val_loss: 0.1990 - val_accuracy: 0.9133
Epoch 18/20
24/24 - 0s - loss: 0.1154 - accuracy: 0.9581 - val_loss: 0.2005 - val_accuracy: 0.9133
Epoch 19/20
24/24 - 0s - loss: 0.1099 - accuracy: 0.9615 - val_loss: 0.2027 - val_accuracy: 0.9133
Epoch 20/20
24/24 - 0s - loss: 0.1065 - accuracy: 0.9631 - val_loss: 0.2048 - val_accuracy: 0.9133
WARNING:tensorflow:From <ipython-input-42-611556489883>:5: Sequential.predict_classes (from tensorflow.python.keras.engine.sequential) is deprecated and will be removed after 2021-01-01.
Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).

Now, let's create a function to plot the confusion matrix...

In [43]:
def plot_cm(cm, classes, normalize=False, title='Confusion matrix', cmap='Blues'):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, fontsize=14)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    plt.ylabel('Actual')
    plt.xlabel('Predicted')

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
In [44]:
undersample_cm = confusion_matrix(final_ytest, undersample_pred)
labels = ['Safe', 'Fraud']

plt.figure(figsize=(6,5))
plot_cm(undersample_cm, labels, title="Random Undersample\nConfusion Matrix")
Confusion matrix, without normalization
In [45]:
print(classification_report(final_ytest, undersample_pred, target_names=labels, digits=4))
              precision    recall  f1-score   support

        Safe     0.9999    0.9677    0.9835     56864
       Fraud     0.0472    0.9286    0.0897        98

    accuracy                         0.9676     56962
   macro avg     0.5235    0.9481    0.5366     56962
weighted avg     0.9982    0.9676    0.9820     56962

Awesome! The undersampled NN performed pretty well, with recall and precision similar to what we saw from our oversampled logistic regression model earlier.

Let's see how our performance changes when we train our NN using oversampled data.

Oversampled NN (SMOTE)

In [46]:
sm = SMOTE(sampling_strategy='minority', random_state=49)
Xsm_train, ysm_train = sm.fit_sample(final_Xtrain, final_ytrain)

NN_oversample = Sequential([Dense(Xsm_train.shape[1], input_shape=(Xsm_train.shape[1], ), activation='relu'),
                            Dense(32, activation='relu'),
                            Dense(2, activation='softmax')])
NN_oversample.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
dense_3 (Dense)              (None, 30)                930
_________________________________________________________________
dense_4 (Dense)              (None, 32)                992
_________________________________________________________________
dense_5 (Dense)              (None, 2)                 66
=================================================================
Total params: 1,988
Trainable params: 1,988
Non-trainable params: 0
_________________________________________________________________
In [47]:
NN_oversample.compile(Adam(lr=0.001), metrics=['accuracy'], loss='sparse_categorical_crossentropy')

NN_oversample.fit(Xsm_train, ysm_train, validation_split=0.2, batch_size=300, epochs=20, shuffle=True, verbose=2)

oversample_pred = NN_oversample.predict_classes(final_Xtest)
Epoch 1/20
1214/1214 - 1s - loss: 0.0856 - accuracy: 0.9694 - val_loss: 0.0289 - val_accuracy: 0.9903
Epoch 2/20
1214/1214 - 1s - loss: 0.0180 - accuracy: 0.9951 - val_loss: 0.0096 - val_accuracy: 0.9993
Epoch 3/20
1214/1214 - 1s - loss: 0.0099 - accuracy: 0.9978 - val_loss: 0.0048 - val_accuracy: 0.9996
Epoch 4/20
1214/1214 - 1s - loss: 0.0068 - accuracy: 0.9985 - val_loss: 0.0024 - val_accuracy: 0.9998
Epoch 5/20
1214/1214 - 1s - loss: 0.0056 - accuracy: 0.9988 - val_loss: 0.0034 - val_accuracy: 0.9996
Epoch 6/20
1214/1214 - 1s - loss: 0.0048 - accuracy: 0.9990 - val_loss: 0.0014 - val_accuracy: 0.9998
Epoch 7/20
1214/1214 - 1s - loss: 0.0043 - accuracy: 0.9990 - val_loss: 0.0020 - val_accuracy: 0.9998
Epoch 8/20
1214/1214 - 1s - loss: 0.0037 - accuracy: 0.9992 - val_loss: 0.0042 - val_accuracy: 0.9995
Epoch 9/20
1214/1214 - 1s - loss: 0.0033 - accuracy: 0.9992 - val_loss: 0.0013 - val_accuracy: 0.9998
Epoch 10/20
1214/1214 - 1s - loss: 0.0031 - accuracy: 0.9993 - val_loss: 0.0141 - val_accuracy: 0.9954
Epoch 11/20
1214/1214 - 1s - loss: 0.0031 - accuracy: 0.9993 - val_loss: 0.0017 - val_accuracy: 1.0000
Epoch 12/20
1214/1214 - 1s - loss: 0.0023 - accuracy: 0.9995 - val_loss: 8.4097e-04 - val_accuracy: 0.9998
Epoch 13/20
1214/1214 - 1s - loss: 0.0024 - accuracy: 0.9995 - val_loss: 4.3279e-04 - val_accuracy: 1.0000
Epoch 14/20
1214/1214 - 1s - loss: 0.0023 - accuracy: 0.9995 - val_loss: 8.5136e-04 - val_accuracy: 0.9999
Epoch 15/20
1214/1214 - 1s - loss: 0.0021 - accuracy: 0.9995 - val_loss: 0.0022 - val_accuracy: 1.0000
Epoch 16/20
1214/1214 - 1s - loss: 0.0019 - accuracy: 0.9995 - val_loss: 2.8853e-04 - val_accuracy: 1.0000
Epoch 17/20
1214/1214 - 1s - loss: 0.0020 - accuracy: 0.9995 - val_loss: 5.6822e-04 - val_accuracy: 1.0000
Epoch 18/20
1214/1214 - 2s - loss: 0.0017 - accuracy: 0.9996 - val_loss: 6.3515e-04 - val_accuracy: 1.0000
Epoch 19/20
1214/1214 - 1s - loss: 0.0018 - accuracy: 0.9995 - val_loss: 3.6817e-04 - val_accuracy: 1.0000
Epoch 20/20
1214/1214 - 2s - loss: 0.0018 - accuracy: 0.9996 - val_loss: 0.0042 - val_accuracy: 0.9992
In [48]:
oversample_smote = confusion_matrix(final_ytest, oversample_pred)

plt.figure(figsize=(6,5))
plot_cm(oversample_smote, labels, title="SMOTE Oversample\nConfusion Matrix ", cmap=plt.cm.Greens)
Confusion matrix, without normalization
In [49]:
print(classification_report(final_ytest, oversample_pred, target_names=labels, digits=4))
              precision    recall  f1-score   support

        Safe     0.9998    0.9991    0.9994     56864
       Fraud     0.6296    0.8673    0.7296        98

    accuracy                         0.9989     56962
   macro avg     0.8147    0.9332    0.8645     56962
weighted avg     0.9991    0.9989    0.9990     56962

Awesome! The oversampled NN performed very well, with high f1 scores on both safe and fraud transactions, and exceptionally high fraud precision.

While the oversampled NN miscalssified a handful more fraud transactions than the undersampled NN, the undersampled NN missclassified almost 200x more safe transactions. It's important to recognize that the cost of missclassifying a safe transaction is non-zero. Missclassifying a safe transaction blocks the cardholder from making additional transactions until that cardholder can verfiy their account, which costs the cardholder and the financial institution. Overall, it's probably safe to assume that the cost of missclassifying 1 fraud transactions does not outweight the cost of blocking ~300 safe transactions. Thus, we assume the oversampled NN is the best model overall.

Findings / Conclusion

While the plan is to continue to iterate on this project, below is a summary of what we've learned so far.

EDA & Preprocessing

It's important to always explore the distributions of features in a dataset before beginning any predictive modeling. This helps with many things, including identifying which features need to be scaled, uncovering imabalances in our data, and locating outliers, nulls, and other important values.

When determining feature correlations in a highly imbalanced dataset, it's important to first balance our data through a method like random sampling. If we don't first balance our data, any correlations we derive will overfit to the majority class, and we won't have a clear view of which features most influence the minoirty class.

In our analysis, we balanced our data through random undersampling of safe transactions. We were then able to accurately determine correlations and identify which features would most influence our predictions. From those highly correlated features, we were able to identify and remove the most impactful extreme outliers.

Anticipating Model Effectiveness with Dimensionality Reduction

We can use dimensionality reduction techniques like T-SNE, PCA & Truncated SVD to get a sense of how well our classifiers might perform.

To do this, we first reduced the features of our dataset to just two, and visualized the results in the xy-plane. Highlighting the different classes in this visualization, we saw how easily seperable or "clustered" each class was, which gave us an indication that our classification models would perform well down the line.

Cross-Validation & Imbalanced Data

In dealing with this imbalanced dataset, it was important that we first balance our data before training our classification models. Otherwise, our models would have overfit to the majority class, assuming that practically all transactions are safe.

When cross-validating, however, we took two approaches:

  1. First, we cross-validated on the undersampled (balanced) dataset. This allowed us to see which model best recognized fraud transactions. The performance metrics from this cross-validation approach (accuracy, precision, recall, etc.) were not representative of what we would expect to see in production, however, because the balanced data did not represent the imbalance we would expect to see in production.

  2. Second, to get a true view of our model's performances, we cross-validated on the original imbalanced data. This dataset was representative of the imbalance we would expect to see in production, thus cross-validating on this dataset gave an accurate view of our performance metrics.

Baseline Models - Random Undersampling

We began by evaluating the performance of four different models: Logistic Regression, K Nearest Neighbors Classifier, Support Vector Classifier, and the Decision Tree Classifier. We made sure to optimize the hyperparameters for each model using GridSearchCV, trained and cross-validated our models on the undersampled (balanced) data, and checked learning curves to confirm our models were not over/underfitting.

To evaluate which model performed best at identifying fraud, we looked at both cross-validation accuracy and ROC_AUC on our undersampled (balanced) training data. We further evaluated each model on our undersampled test data, creating confusion matrices for each.

In the end, logistic regression performed the best at identifying fraud, thus we chose to focus on this model moving forward.

Logistic Regression Performance - Random Undersampling

Proceeding with our logistic regression model trained on undersampled data, we cross-validated the model using the original (imbalanced) training data to get a true view of its performance. We saw that our true performance metrics were lower than when we cross-validated on the undersampled data, especially precision (which made sense, as the imbalanced data consists of far more safe transactions with the potential to be misclassified as fraud).

Using our final test data, we looked at our model's precision-recall curve. We found relatively low scores overall, further indicating that our model did not do a great job of keeping safe transactions from being missclassified as the decision boundary was relaxed.

Logistic Regression Performance - Oversampling (SMOTE)

Next, we went back to the drawing board and trained logistic regression on oversampled data using the SMOTE technique. We optimized hyperparameters with RandomizedSearchCV this time, due to the large size of the training data. Upon cross-validating with the original (imbalanced) training data, we saw significantly higher precision and accuracy than from our previous logistic regression model trained on undersampled data. The reason for this is that we don't lose information about safe transction when oversampling (SMOTE), thus we can better classify them.

We saw that while both models performed similarly on the undersampled (balanced) test data, the oversampled logistic regression model produced much higher accuracy on the final (imbalanced) test data. Furthermore, the oversampled logistic regression model far outperformed the undersampled model in terms of precision-recall.

Neural Network

Finally, we put our baseline model aside, and investigated the performance of neural networks. We built two simple NNs: the first was trained on the undersampled data, the second was trained on the oversampled data.

After evaluating both NNs on our original test data, we found that the oversampled NN also outperformed the undersampled NN (assuming 1 fraud is less important than ~300 blocked accounts).

Key Takeaways / Next Steps

Balancing our dataset using oversampling (SMOTE) helped us improve our logistic regression and Neural Network models over random undersampling. While it's possible that undersampling could outperform oversampling in certain scenarios (e.g. depending on the tradeoff between false positives and false negatives), in general oversampling preserves more of the original data, and thus models trained with oversampling perform more accurate classification. Of course, the tradeoff with oversampling is that it uses more time/resources, given it consists of more training data.

Next Steps: Recall that we removed outliers from our undersampled data before training our models. We should also do the same for our oversampled data to see if our performance improves further.

In [8]:
! jupyter nbconvert --to html Credit_Card_Fraud_Detection.ipynb
[NbConvertApp] Converting notebook Credit_Card_Fraud_Detection.ipynb to html
[NbConvertApp] Writing 1979390 bytes to Credit_Card_Fraud_Detection.html
In [ ]: