First, I'd like to thank Janio Martinez & Pavan Sanagapati for inspiring me to take on this project. Both have provided excellent, in-depth analyses of this Kaggle Dataset, and I've learned a great deal from both.
I'd also like to thank Marco Altini for his article Dealing with Imbalanced Data: Undersampling, Oversampling, and Proper Cross-Validation, which helped me better understand many of the concepts in this notebook.
If you have any questions, suggestions for improving upon my approach, or just like my work, please don't hesitate to reach out at keilordykengilbert@gmail.com. Having a conversation is the best way for everyone involved to learn & improve 😊
In this kernel, our overall goal is to develop the best model for classifying both fraud and safe transactions correctly, working with Kaggle's Credit Card Fraud Detection dataset.
We are given a dataset of roughly 285K transactions to work with, for which we have the following features: Time, Amount, and 28 anonymized features (V1-V28) which are the result of a previously-performed PCA. We know that these anonymized features have already been scaled, but Time and Amount have not. We also have our target variable Class, which tells us if transactions are fraudulent.
Our approach includes:
Let's begin by loading the following libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold, cross_val_score, cross_val_predict, train_test_split
from sklearn.model_selection import StratifiedKFold, learning_curve, ShuffleSplit, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn import metrics
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, roc_curve, accuracy_score
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, average_precision_score
from scipy import stats
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD
import matplotlib.patches as mpatches
from imblearn.under_sampling import NearMiss
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
First, let's take a high-level view of our data to get a good sense of what we're working with. In particular, let's look at summary statistics for each feature, the balance between fraud & safe transactions, and null values.
txn = pd.read_csv('creditcard.csv')
txn
txn.describe().T
txn.info()
No nulls!
txn_type = txn['Class'].apply(lambda x: 'Fraud' if x==1 else 'Not Fraud').value_counts()
print('There are {} fraud transactions ({:.2%})'.format(txn_type['Fraud'], txn_type['Fraud']/txn.shape[0]))
print('There are {} safe transactions ({:.2%})'.format(txn_type['Not Fraud'], txn_type['Not Fraud']/txn.shape[0]))
Awesome! This gives us a helpful insight: with only 0.17% fraud rate, we have a very imbalanced dataset, which we will need to account for when developing our predictive models.
Let's take a closer look at the distribution of each feature.
fig, axes = plt.subplots(7,4,figsize=(14,14))
feats = ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10','V11', 'V12', 'V13', 'V14', 'V15',
'V16', 'V17', 'V18', 'V19', 'V20','V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28']
for i, ax in enumerate(axes.flatten()):
ax.hist(txn[feats[i]], bins=25, color='green')
ax.set_title(str(feats[i])+' Distribution', color='brown')
ax.set_yscale('log')
plt.tight_layout()
max_val = np.max(txn[feats].values)
min_val = np.min(txn[feats].values)
print('All values range: ({:.2f}, {:.2f})'.format(min_val, max_val))
The anonymous V__ features tend to exhibit various distributions and ranges. Overall, values across all features fall within the range (-114, 121). Note we're using a logarithmic y-axis to help visualize the data more clearly.
Let's also look at the distributions for Time & Amount:
plt.figure(figsize=(14,6))
sns.distplot(txn['Time'])
plt.figure(figsize=(14,6))
sns.distplot(txn['Amount'], hist=False, rug=True)
Our next step in preparing the data for predictive modeling is to scale our features appropriately. We can infer that the V__ features of this dataset are already scaled, as they have already undergone PCA (in which scaling would have occurred). However, we still need to scale Amount and Time. To do this, we'll use RobustScaler(), which performs better on datasets with significant outliers (note the outliers in the Amount distribution above).
txn['Amount'] = RobustScaler().fit_transform(txn['Amount'].values.reshape(-1,1))
txn['Time'] = RobustScaler().fit_transform(txn['Time'].values.reshape(-1,1))
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(14,4))
sns.distplot(txn['Time'], ax=ax1)
sns.distplot(txn['Amount'], hist=False, rug=True, ax=ax2)
Success! (Note the change in scale on the x-axis)
Now let's split our data into our final train and test populations. While we will train & cross-validate our models on various samples of the the training data (final_Xtrain, final_ytrain - more on this later), we will preserve this test data (final_Xtest, final_ytest) to evaluate the final performance of our models at the end of our analysis. Note that we make sure to stratify by our target variable, Class, in order to ensure both populations are representative.
X = txn.drop(['Class'], axis=1)
y = txn['Class']
final_Xtrain, final_Xtest, final_ytrain, final_ytest = train_test_split(X,
y, test_size=0.2, stratify=y, random_state=42)
final_Xtrain = final_Xtrain.values
final_Xtest = final_Xtest.values
final_ytrain = final_ytrain.values
final_ytest = final_ytest.values
train_unique_label, train_counts_label = np.unique(final_ytrain, return_counts=True)
test_unique_label, test_counts_label = np.unique(final_ytest, return_counts=True)
print()
print('Proportions [Safe vs Fraud]')
print('Training %: '+ str(100*train_counts_label/len(final_ytrain)))
print('Testing %: '+ str(100*test_counts_label/len(final_ytest)))
Awesome! Per our output, both our train & test populations have 99.83% safe and 0.17% fraud transactions.
Now that we've explored and scaled our data, we can focus on "balancing" our data.
Balancing our data means resampling our data such that we have a 50/50 split between fraud and safe transactions (as we just saw, we're pretty far from that at the moment with 99.83% safe vs 0.17% fraud). It is import that we create this 50/50 split before training our models because otherwise our models will overfit to the majority class, assuming that essentially all transactions are safe. We want to train our model to recognize the characteristics of fraud, not assume that most transactions aren't fraud. Thus we must create a 50/50 split to remove the effect of the imbalance during training.
Let's consider two ways to create this 50/50 split:
Random Undersampling: Here, we reduce the number of transactions in the majority class (i.e. safe transactions) randomly, until the counts of both the majority and minority classes are equal.
SMOTE Oversampling: Here, we increase the number of transactions in the minority class (i.e. fraud transactions) through creating synthetic transactions in between existing ones that are in close proximity. Synthetic transactions continue to be created until the counts of both the majority and minority classes are equal.
We will perform both sampling approaches to determine which performs better. Let's start with random undersampling:
# shuffle data first, so it's random
txn = txn.sample(frac=1)
txn_fraud = txn.loc[txn['Class']==1]
txn_safe = txn.loc[txn['Class']==0][:492]
txn_under = pd.concat([txn_fraud, txn_safe])
txn_under = txn_under.sample(frac=1, random_state=41)
txn_under.shape
txn_type = txn_under['Class'].apply(lambda x: 'Fraud' if x==1 else 'Not Fraud').value_counts()
print('Randomly Undersampled Dataset (txn_under):')
print('There are {} fraud transactions ({:.2%})'.format(txn_type['Fraud'], txn_type['Fraud']/txn_under.shape[0]))
print('There are {} safe transactions ({:.2%})'.format(txn_type['Not Fraud'], txn_type['Not Fraud']/txn_under.shape[0]))
Awesome! Through random undersampling, we now have a balanced dataset, txn_under, with which we can predict fraudulent transactions without overfitting.
Next, let's create a correlation matrix to get a clear view of which features are most heavily correlated with fraud. Note that we should only use a balanced dataset (i.e. txn_under) to determine correlations, as per our discussion above, an imbalanced dataset (i.e. txn) overfits to the majority class. Let's create correlation matrices for both to illustrate the difference (but make sure you only reference the balanced matrix to determine correlations!).
fig, axes = plt.subplots(2, 1, figsize=(20,16))
sns.heatmap(txn_under.corr(), annot=True, fmt=".2f", cmap = 'RdYlGn', ax=axes[0])
axes[0].set_title("Balanced Correlation Matrix (Reference This One)", fontsize=20, fontweight='bold')
sns.heatmap(txn.corr(), annot=True, fmt=".2f", cmap = 'RdYlGn', ax=axes[1])
axes[1].set_title('Imbalanced Correlation Matrix', fontsize=20, fontweight='bold')
plt.tight_layout()
Great! As we can see, balancing our data gives very different (true) correlation values. Referencing the balanced output, we see fraud is most highly correlated with the following features:
Negative Correlations: V14, V12, V10
Positive Correlations: V11, V4
Let's view these features a bit more closely...
fig, axes = plt.subplots(2,3,figsize=(14,8))
high_corr_feats = ['V14', 'V12', 'V10', 'V11', 'V4']
for i, ax in enumerate(axes.flatten()):
if i == 5:
ax.axis('off')
break
sns.boxplot(x='Class', y=high_corr_feats[i], data=txn_under, ax=ax, palette=sns.color_palette('magma_r', 2))
ax.set_ylabel(None)
ax.set_title(label=high_corr_feats[i], fontsize=16, fontweight='bold')
plt.tight_layout()
Awesome! Just as we should expect -- fraud transactions have lower values for features that are negatively correlated (V14, V12, V10) and higher values for features that are positively correlated (V11, V4).
Note the presence of outliers in some of these features. We want to remove extreme fraud outliers from the most highly correlated features. This is an important preprocessing step for maximizing model performance on recognizing fraud. Addressing outliers in features with the highest correlations will help ensure we are addressing only the most impactful outliers, and removing only the most extreme outliers will help us prevent unnecessary loss of information.
We will use the IQR to identify outliers. But first, let's get a better view of the distributions for each feature, to ensure they are roughly normal, and thus our outlier removal approach makes sense.
fig, axes = plt.subplots(2,3,figsize=(14,7))
fig.suptitle(' Fraud Transaction Distributions', fontsize=20, fontweight='bold')
for i, ax in enumerate(axes.flatten()):
if i == 5:
ax.axis('off')
break
v_fraud = txn_under[txn_under['Class']==1][high_corr_feats[i]].values
sns.distplot(v_fraud, ax=ax, fit=stats.norm)
ax.set_title(str(high_corr_feats[i]), fontsize=12)
These all look normal enough to proceed. We're ready to remove outliers:
len(txn_under)
high_corr_feats2 = ['V14', 'V12', 'V10', 'V11', 'V4']
for i in high_corr_feats2:
v_fraud = txn_under[txn_under['Class']==1][i]
q75 = np.percentile(v_fraud, 75)
q25 = np.percentile(v_fraud, 25)
iqr = q75 - q25
v_lower, v_upper = q25-1.5*iqr, q75+1.5*iqr
outliers = [x for x in v_fraud if x > v_upper or x < v_lower]
print(str(len(outliers))+' '+str(i)+' fraud outliers: '+str(outliers)+'\n')
txn_under = txn_under.drop(txn_under.index[txn_under[i].isin(outliers) &
txn_under['Class']==1])
len(txn_under)
fig, axes = plt.subplots(2, 3, figsize=(20,12))
fig.suptitle(' Outlier Reduction', fontsize=20, fontweight='bold')
loc1 = [(0.98, -17.5), (0.98, -17.3), (0.98, -14.5), (0.98, 9.2), (0.98, 10.8)]
loc2 = [(0, -12), (0, -12), (0, -12), (0, 6), (0, 8)]
for i, ax in enumerate(axes.flatten()):
if i == 5:
ax.axis('off')
break
sns.boxplot(x="Class", y=high_corr_feats[i], data=txn_under, ax=ax, palette=sns.color_palette('magma_r', 2))
ax.set_title(str(high_corr_feats[i]), fontsize=16, fontweight='bold')
ax.annotate('Fewer extreme\n outliers', xy=loc1[i], xytext=loc2[i],
arrowprops=dict(facecolor='Red'), fontsize=14)
ax.set_ylabel('')
Great! We successfully removed the most extreme outliers from our most highly-correlated features.
Before we proceed with creating classification models, let's first get a sense of how effective our models might be through performing dimensionality reduction on our data.
Specifically, we'll use three dimensionality reduction methods (t-SNE, PCA & Truncated SVD) to reduce the number of our features to just two. We will then graph the results on the xy-plane, highlighting fraud and safe transactions differently. If we're able to see a clear separation between classes in the graphs, that will give us an indication that further predictive models may perform well at classifying fraud.
X = txn_under.drop('Class', axis=1)
y = txn_under['Class']
# Implement dimensionality reductions
X_pca = PCA(n_components=2, random_state=38).fit_transform(X.values)
X_svd = TruncatedSVD(n_components=2, algorithm='randomized', random_state=37).fit_transform(X.values)
X_tsne = TSNE(n_components=2, random_state=39).fit_transform(X.values)
f, axes = plt.subplots(1, 3, figsize=(24,6))
# labels = ['No Fraud', 'Fraud']
f.suptitle(' Dimensionality Reductions', fontsize=20, fontweight='bold')
green_patch = mpatches.Patch(color='darkgreen', label='No Fraud')
red_patch = mpatches.Patch(color='darkred', label='Fraud')
dim_red = [X_pca, X_svd, X_tsne]
titles = ['PCA', 'Truncated SVD', 't-SNE']
for i, ax in enumerate(axes):
ax.scatter(dim_red[i][:,0], dim_red[i][:,1], c=(y == 0), cmap='RdYlGn', label='No Fraud', linewidths=2)
ax.scatter(dim_red[i][:,0], dim_red[i][:,1], c=(y == 1), cmap='RdYlGn', label='Fraud', linewidths=2)
ax.set_title(titles[i], fontsize=20)
ax.grid(True)
ax.legend(handles=[green_patch, red_patch])
Fantastic! We see a clear separation between fraud & safe transactions in all 3 graphs, especially in PCA and Truncated SVD! This gives us a good indication that our predictive models will be able to effectively classify fraud.
We're finally ready to create our classification models! We'll start with four types of classifiers: Logistic Regression, K Nearest Neighbors, SVC, and Decision Tree.
First, we'll train and cross-validate our models on the txn_under dataset to get a sense of which model does the best job of recognizing fraud. Once we have an opinion on which model does best, we'll cross-validate that model on the unsampled (original) data to get an objective view of its true performance.
Before we proceed further, let's digress for a moment to talk about a common mistake made when dealing with imbalanced datasets. While we should train our models on balanced data, we should cross-validate on imbalanced (i.e. original) data to best evaluate the objective performance of our trained models. Cross-validating on the original data gives us the most accurate view of our model's performance in production, because the original data preserves and is representative of the imbalance we would expect to see in production.
Our initial goal here, however, isn't to determine objective performance, but rather to determine which model does the best job of identifying the characteristics of fraud. In this case, it is okay for us to cross-validate with our undersampled data, as long as we are consistent in our approach across all models. We will still be able to rank models' abilities to identify fraud by cross-validating on undersampled data, even if the scores we derive are not representative of what we would expect to see in production.
Let's start by splitting our undersampled data, txn_under, into train and test sets:
X = txn_under.drop('Class', axis=1)
y = txn_under['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
models = {"Log Reg": LogisticRegression(), "KNN": KNeighborsClassifier(), "SVC": SVC(),
"D Tree": DecisionTreeClassifier()}
print('Mean cv accuracy on undersampled data. \n')
for name, model in models.items():
training_acc = cross_val_score(model, X_train, y_train, cv=5)
print(name+":", str(round(training_acc.mean()*100, 2))+"%")
Despite the fact that these scores are not representative of the true production accuracy, we do see that all four models do a very good job (~90% or more) of identifying fraud... which is good, and inline with what we might expect given what we saw from our dimensionality reduction earlier. Let's see how much we can improve these scores further by optimizing our hyperparameters with GridSearchCV. Again, we'll perform cross-validation on the balanced data so that we can compare to the baseline scores above:
# Use GridSearchCV to find the best parameters.
print('Mean cv scores on undersampled data after tuning hyperparameters. \n')
# Logistic Regression
log_reg_params = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]}
grid_log_reg = GridSearchCV(LogisticRegression(max_iter=10000), log_reg_params)
grid_log_reg.fit(X_train, y_train)
log_reg = grid_log_reg.best_estimator_
log_reg_score = cross_val_score(log_reg, X_train, y_train, cv=5)
print('Log Reg: ', round(log_reg_score.mean() * 100, 2).astype(str) + '%')
# K Nearest Neighbors
knn_params = {"n_neighbors": list(range(2,6,1)), 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}
grid_knn = GridSearchCV(KNeighborsClassifier(), knn_params)
grid_knn.fit(X_train, y_train)
knn = grid_knn.best_estimator_
knn_score = cross_val_score(knn, X_train, y_train, cv=5)
print('KNN: ', round(knn_score.mean() * 100, 2).astype(str) + '%')
# SVC
svc_params = {'C': [0.5, 0.6, 0.7, 0.8, 0.9, 1], 'kernel': ['rbf', 'poly', 'sigmoid', 'linear']}
grid_svc = GridSearchCV(SVC(), svc_params)
grid_svc.fit(X_train, y_train)
svc = grid_svc.best_estimator_
svc_score = cross_val_score(svc, X_train, y_train, cv=5)
print('SVC: ', round(svc_score.mean() * 100, 2).astype(str) + '%')
# DescisionTree
tree_params = {"criterion": ["gini", "entropy"], "max_depth": list(range(2,4,1)),
"min_samples_leaf": list(range(3,7,1))}
grid_tree = GridSearchCV(DecisionTreeClassifier(), tree_params)
grid_tree.fit(X_train, y_train)
tree = grid_tree.best_estimator_
tree_score = cross_val_score(tree, X_train, y_train, cv=5)
print('D Tree: ', str(round(tree_score.mean() * 100, 2)) + '%')
Awesome! We were able to improve our cross-validation scores.
Let's plot learning curves to get a sense of whether our models are over/underfitting. Note that the wider the gap between our training and cross-validation scores, the more likely we are overfitting:
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=42)
fig, axes = plt.subplots(2,2, figsize=(18,12), sharey=True)
classifier = [log_reg, knn, svc, tree]
titles = ["Logistic Regression Learning Curve", "K Nearest Neighbors Learning Curve",
"Support Vector Classifier Learning Curve", "Decision Tree Classifier Learning Curve"]
for i, ax in enumerate(axes.flatten()):
train_sizes, train_acc, test_acc = learning_curve(
classifier[i], X_train, y_train, cv=cv, n_jobs=4, train_sizes=np.linspace(.1, 1.0, 10))
train_acc_mean = np.mean(train_acc, axis=1)
train_acc_std = np.std(train_acc, axis=1)
test_acc_mean = np.mean(test_acc, axis=1)
test_acc_std = np.std(test_acc, axis=1)
ax.fill_between(train_sizes, train_acc_mean - train_acc_std, train_acc_mean + train_acc_std, alpha=0.3, color="#b2b8b7")
ax.fill_between(train_sizes, test_acc_mean - test_acc_std, test_acc_mean + test_acc_std, alpha=0.3, color="#46d448")
ax.plot(train_sizes, train_acc_mean, 'o-', color="#b2b8b7", label="Training accuracy")
ax.plot(train_sizes, test_acc_mean, 'o-', color="#46d448", label="Cross-validation accuracy")
ax.set_title(titles[i], fontsize=14)
ax.set_xlabel('Training size')
ax.set_ylabel('Accuracy')
ax.grid(True)
ax.legend(loc='upper right')
plt.ylim(0.86, 1.01);
Great! We don't appear to be overfitting given the lack of a large gap between curves as training size increases, and we don't appear to be underfitting given how high our scores are.
Let's take a look at the Receiver Operating Characteristic (ROC) curve for each of our models, to get a better sense of their performance. ROC curves are used to show in a graphical way the connection/trade-off between False Positive Rate (the percent of safe transactions our model classifies incorrectly) and True Positive Rate (the percentage of fraud transactions our model classifies correctly) for every possible decision boundary in our models. In other words, it shows the tradeoff between our incorrect classification of safe transactions and our correct classification of fraud transactions as we move through every possible decision boundary.
To evaluate model performance, we'll calculate the area under the ROC curve (AUC). This tells us which model does the best job of distinguishing between fraud and safe transactions overall.
print ('Model ROC AUC \n')
log_reg_pred = cross_val_predict(log_reg, X_train, y_train, cv=5, method="decision_function")
print('Log Reg: {:.4f}'.format(roc_auc_score(y_train, log_reg_pred)))
knn_pred = cross_val_predict(knn, X_train, y_train, cv=5)
print('KNN: {:.4f}'.format(roc_auc_score(y_train, knn_pred)))
svc_pred = cross_val_predict(svc, X_train, y_train, cv=5, method="decision_function")
print('SVC: {:.4f}'.format(roc_auc_score(y_train, svc_pred)))
tree_pred = cross_val_predict(tree, X_train, y_train, cv=5)
print('D Tree: {:.4f}'.format(roc_auc_score(y_train, tree_pred)))
log_fpr, log_tpr, log_thresold = roc_curve(y_train, log_reg_pred)
knn_fpr, knn_tpr, knn_threshold = roc_curve(y_train, knn_pred)
svc_fpr, svc_tpr, svc_threshold = roc_curve(y_train, svc_pred)
tree_fpr, tree_tpr, tree_threshold = roc_curve(y_train, tree_pred)
plt.figure(figsize=(12,6))
plt.title('ROC Curves', fontsize=18)
plt.plot(log_fpr, log_tpr, label='Logistic Regression Classifier Score: {:.4f}'.format(roc_auc_score(y_train, log_reg_pred)))
plt.plot(knn_fpr, knn_tpr, label='K Nearest Neighbors Classifier Score: {:.4f}'.format(roc_auc_score(y_train, knn_pred)))
plt.plot(svc_fpr, svc_tpr, label='Support Vector Classifier Score: {:.4f}'.format(roc_auc_score(y_train, svc_pred)))
plt.plot(tree_fpr, tree_tpr, label='Decision Tree Classifier Score: {:.4f}'.format(roc_auc_score(y_train, tree_pred)))
plt.plot([0, 1], [0, 1], 'k--')
plt.axis([-0.01, 1, 0, 1])
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.annotate('Minimum Possible ROC Score (50%)', xy=(0.5, 0.5), xytext=(0.6, 0.3),
arrowprops=dict(facecolor='Red', shrink=0.05))
plt.legend()
Awesome! With the highest ROC_AUC score and undersampled cross-validation accuracy, logistic regression looks like it could be our best model...
We're finally ready to test our models with our X_test/y_test data to determine which model does the best at recognizing fraud (keep in mind, this is still not the original/imbalanced data).
# Predict on X_test
log_reg_pred2 = log_reg.predict(X_test)
knn_pred2 = knn.predict(X_test)
svc_pred2 = svc.predict(X_test)
tree_pred2 = tree.predict(X_test)
log_reg_cf = confusion_matrix(y_test, log_reg_pred2)
knn_cf = confusion_matrix(y_test, knn_pred2)
svc_cf = confusion_matrix(y_test, svc_pred2)
tree_cf = confusion_matrix(y_test, tree_pred2)
fig, axes = plt.subplots(2, 2,figsize=(18,10))
titles = ['Logistic Regression', 'K Nearest Neighbors', 'Suppor Vector Classifier', 'DecisionTree Classifier']
conf_matrix = [log_reg_cf, knn_cf, svc_cf, tree_cf]
fig.suptitle('Confusion Matrices (NearMiss Undersampling) ', fontsize=20, fontweight='bold')
for i, ax in enumerate(axes.flatten()):
sns.heatmap(conf_matrix[i], ax=ax, annot=True, fmt='.0f', cmap='magma')
ax.set_title(titles[i], fontsize=14)
ax.set_xticklabels(['Predicted\nSafe', 'Predicted\nFraud'], fontsize=10)
ax.set_yticklabels(['Safe', 'Fraud'], fontsize=10)
print('Logistic Regression:')
print(classification_report(y_test, log_reg_pred2))
print('K Nearest Neighbors:')
print(classification_report(y_test, knn_pred2))
print('Support Vector Classifier:')
print(classification_report(y_test, svc_pred2))
print('DecisionTree Classifier:')
print(classification_report(y_test, tree_pred2))
Awesome! Based on our undersampled test data, logistic regression performed the best at recognizing fraud. We will continue our analysis with just Logistic Regression.
First, let's come back to our previous conversation on the proper way to cross-validate for estimating true performance. So far, we've only cross-validated using balanced/undersampled data. Let's now cross-validate logistic regression using the original, imbalanced data.
Recall that we still have to train our model on balanced data. To do this, we'll implement the NearMiss() algorithm, which is an undersampling technique that works by randomly removing samples of the majority class that are close to samples of the minority class, thus clarifying the boundary between both classes and improving model classification.
accuracy_undersample = []
precision_undersample = []
recall_undersample = []
f1_undersample = []
auc_undersample = []
# Cross-Validating correctly to determine real-world performance
sss = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)
for train, test in sss.split(final_Xtrain, final_ytrain):
pipeline_undersample = imbalanced_make_pipeline(NearMiss(), log_reg)
model_undersample = pipeline_undersample.fit(final_Xtrain[train], final_ytrain[train])
prediction_undersample = model_undersample.predict(final_Xtrain[test])
accuracy_undersample.append(pipeline_undersample.score(final_Xtrain[test], final_ytrain[test]))
precision_undersample.append(precision_score(final_ytrain[test], prediction_undersample))
recall_undersample.append(recall_score(final_ytrain[test], prediction_undersample))
f1_undersample.append(f1_score(final_ytrain[test], prediction_undersample))
auc_undersample.append(roc_auc_score(final_ytrain[test], prediction_undersample))
Now let's print our performance metrics for both cross-validation approaches (balanced vs imbalanced data) to see how they compare.
precision, recall, threshold = precision_recall_curve(y_train, log_reg_pred)
y_pred = log_reg.predict(X_train)
# Overfit
print('Cross-Validating on Undersampled/Balanced Data: \n')
print('Accuracy Score: {:.4f}'.format(accuracy_score(y_train, y_pred)))
print('Precision Score: {:.4f}'.format(precision_score(y_train, y_pred)))
print('Recall Score: {:.4f}'.format(recall_score(y_train, y_pred)))
print('F1 Score: {:.4f}'.format(f1_score(y_train, y_pred)))
print('---' * 20)
# True
print('Cross-Validating on Original/Imbalanced Data: \n')
print("Accuracy Score: {:.4f}".format(np.mean(accuracy_undersample)))
print("Precision Score: {:.4f}".format(np.mean(precision_undersample)))
print("Recall Score: {:.4f}".format(np.mean(recall_undersample)))
print("F1 Score: {:.4f}".format(np.mean(f1_undersample)))
Awesome! While we tend to see lower scores when cross-validating on the original/imbalanced data, these differences make sense.
For example, we would expect to see much lower precision when cross-validating on the original/imbalanced data than the undersampled/balanced data, simply because the imbalanced data consists of far more safe transactions. Thus, we would expect a higher number to be misclassified as fraud, lowering precision.
Note, however, that recall remains relatively unchanged. This also makes sense, as we're cross-validating on the same (i.e. unsampled) set of fraud transactions in both scenarios.
Now we're ready to test our logistic regression model (trained on NearMiss undersampled data) with our final_Xtest and final_ytest data. We'll use both precision and recall to evaluate performance:
y_score_under = log_reg.decision_function(final_Xtest)
undersample_average_precision = average_precision_score(final_ytest, y_score_under)
fig = plt.figure(figsize=(14,5))
precision, recall, _ = precision_recall_curve(final_ytest, y_score_under)
plt.step(recall, precision, color='Green', alpha=0.3, where='post')
plt.fill_between(recall, precision, step='post', alpha=0.3, color='Green')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.title('Undersampled Precision-Recall Curve (Avg Score: {0:0.4f})'.format(undersample_average_precision), fontsize=16);
Our average precision-recall score on our test data is objectively low. This may seem suspiciously low after seeing our ROC curves above, but remember, we're now evaluating our model on the original/imbalanced data. As we can see in the graph, our model allows many more safe transactions in (i.e. missclassifying them) as we relax our decision boundary.
Now let's proceed with our oversampling approach! We'll be using the SMOTE technique to oversample.
Unlike with the undersampling approach in which we randomly remove data from the majority class, with SMOTE we create new synthetic points for the minority class to achieve class balance. Specifically, SMOTE creates synthetic points between closest-neighbors of the minority class. This approach retains more information than undersampling, as we aren't removing any data from the original dataset. However, because we have to train on more data with SMOTE, the approach is more time/resource-intensive.
Again, while we should train our model on the oversampled/balanced data, it's important that we cross-validate on the original data. Overfitting can result if we cross-validate on the oversampled data, as this population contains additional synthetic samples and does not represent the true imbalance between classes. As with undersampling, we want to cross-validate on the original/imbalanced data to get an objective view of model performance.
accuracy_oversample = []
precision_oversample = []
recall_oversample = []
f1_oversample = []
auc_oversample = []
# Classifier with optimal parameters - we use RandomizedSearch instead of GridSearch, given large sample size.
log_reg_params = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
rand_log_reg = RandomizedSearchCV(LogisticRegression(random_state=4, max_iter=1000), log_reg_params, n_iter=4)
for train, test in sss.split(final_Xtrain, final_ytrain):
# Apply SMOTE during training, not cross-validation.
pipeline = imbalanced_make_pipeline(SMOTE(sampling_strategy='minority'), rand_log_reg)
model = pipeline.fit(final_Xtrain[train], final_ytrain[train])
log_reg_sm = rand_log_reg.best_estimator_
prediction = log_reg_sm.predict(final_Xtrain[test])
accuracy_oversample.append(pipeline.score(final_Xtrain[test], final_ytrain[test]))
precision_oversample.append(precision_score(final_ytrain[test], prediction))
recall_oversample.append(recall_score(final_ytrain[test], prediction))
f1_oversample.append(f1_score(final_ytrain[test], prediction))
auc_oversample.append(roc_auc_score(final_ytrain[test], prediction))
print('Cross-Validation on Original/Imbalanced Data (Correct Approach, SMOTE)')
print('')
print("accuracy: {:.4f}".format(np.mean(accuracy_oversample)))
print("precision: {:.4f}".format(np.mean(precision_oversample)))
print("recall: {:.4f}".format(np.mean(recall_oversample)))
print("f1: {:.4f}".format(np.mean(f1_oversample)))
print("auc: {:.4f}".format(np.mean(auc_oversample)))
Interesting! We tend to see higher cross-validation scores (precision, accuracy, f1) when training with oversampled data than when training with undersampled data from earlier in our analysis... which makes sense.
For example, we see significantly higher precision when training on the oversampled data. The reason for this is that we don't lose any information on the safe transaction set during training (e.g. by randomly removing data) in our oversampling approach, which allows our model to more thoroughly learn the characteristics of safe transactions, and thus missclassify them less often, increasing precision.
Note, however, that recall remained relatively unchanged. This is a testament to how well SMOTE is able to add new synthetic fraud transactions to the training data while maintaining the original distribution/characteristics of fraud.
Out of curiosity, let's compare how our logistic regression models trained on undersampled data (log_reg) vs oversampled data (log_reg_sm) perform on the undersampled test data (X_test & y_test).
# Predict on X_test
log_reg_sm_pred = log_reg_sm.predict(X_test)
log_reg_sm_cf = confusion_matrix(y_test, log_reg_sm_pred)
fig, axes = plt.subplots(1, 2,figsize=(18,6))
titles = ['SMOTE Oversampling', 'NearMiss Undersampling']
conf_matrix = [log_reg_sm_cf, log_reg_cf]
fig.suptitle('Logistic Regression Confusion Matrices ', fontsize=20, fontweight='bold')
for i, ax in enumerate(axes.flatten()):
sns.heatmap(conf_matrix[i], ax=ax, annot=True, fmt='.0f', cmap='magma')
ax.set_title(titles[i], fontsize=14)
ax.set_xticklabels(['Predicted\nSafe', 'Predicted\nFraud'], fontsize=10)
ax.set_yticklabels(['Safe', 'Fraud'], fontsize=10)
labels = ['No Fraud', 'Fraud']
print('Performance on Undersampled Test Data \n')
print('Logistic Regression, SMOTE Oversampling:')
print(classification_report(y_test, log_reg_sm_pred, target_names=labels))
print('Logistic Regression, NearMiss Undersampling:')
print(classification_report(y_test, log_reg_pred2, target_names=labels))
Awesome, SMOTE slightly improved logistic regression's performance on the undersampled test data!
Let's now see which model performs best on the original test data (final_Xtest, final_ytest). We'll start with accuracy:
print('Logistic Regression Performance, Final Testing:\n')
# Logistic regression trained on undersampled data
y_pred = log_reg.predict(final_Xtest)
undersample_accuracy = accuracy_score(final_ytest, y_pred)
print('Undersampling Accuracy: {:.4f}'.format(undersample_accuracy))
# Logistic regression trained on oversampled data
y_pred_sm = log_reg_sm.predict(final_Xtest)
oversample_accuracy = accuracy_score(final_ytest, y_pred_sm)
print('Oversampling Accuracy: {:.4f}'.format(oversample_accuracy))
Wow! Oversampling with SMOTE greatly improved accuracy on the original test data. Let's compare precision and recall as well:
labels = ['No Fraud', 'Fraud']
print('Logistic Regression Performance, Final Testing: \n')
print('Logistic Regression, NearMiss Undersampling:')
print(classification_report(final_ytest, y_pred, target_names=labels))
print('Logistic Regression, SMOTE Oversampling:')
print(classification_report(final_ytest, y_pred_sm, target_names=labels))
y_score_over = log_reg_sm.decision_function(final_Xtest)
oversample_average_precision = average_precision_score(final_ytest, y_score_over)
fig = plt.figure(figsize=(14,5))
precision, recall, _ = precision_recall_curve(final_ytest, y_score_over)
plt.step(recall, precision, color='Red', alpha=0.3, where='post')
plt.fill_between(recall, precision, step='post', alpha=0.3, color='Orange')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.title('Oversampled Precision-Recall Curve (Avg Score: {0:0.4f})'.format(oversample_average_precision), fontsize=16);
Awesome! SMOTE oversampling gives much higher accuracy overall, as well as a higher precision and f1 score for fraud transactions. Additionally, as we can see in our precision-recall graph directly above, training with SMOTE resulted in a much better precision-recall score overall. In comparison to our undersampling precision-recall curve, we now let far fewer safe transactions in (i.e. missclassifying them) as we relax our decision boundary. Clearly, the information loss through undersampling negatively affected the model's ability to correctly classify safe transactions.
Now let's implement two simple neural networks (NNs) to see how they perform!
To create our NNs, let's have one input layer with the same number of nodes as features plus a bias node, a second hidden layer with 32 nodes, and one output node classifying the transaction as 0 (safe) or 1 (fraud).
First, we'll build our undersampled model.
import itertools
import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers import Activation
from keras.layers.core import Dense
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy
NN_undersample = Sequential([Dense(X_train.shape[1], input_shape=(X_train.shape[1], ), activation='relu'),
Dense(32, activation='relu'),
Dense(2, activation='softmax')])
NN_undersample.summary()
Great! Now, let's train this network on the undersampled data, then we'll predict on the final test data.
NN_undersample.compile(Adam(lr=0.001), metrics=['accuracy'], loss='sparse_categorical_crossentropy')
NN_undersample.fit(X_train, y_train, validation_split=0.2, batch_size=25, epochs=20, shuffle=True, verbose=2)
undersample_pred = NN_undersample.predict_classes(final_Xtest)
Now, let's create a function to plot the confusion matrix...
def plot_cm(cm, classes, normalize=False, title='Confusion matrix', cmap='Blues'):
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title, fontsize=14)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
plt.ylabel('Actual')
plt.xlabel('Predicted')
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
undersample_cm = confusion_matrix(final_ytest, undersample_pred)
labels = ['Safe', 'Fraud']
plt.figure(figsize=(6,5))
plot_cm(undersample_cm, labels, title="Random Undersample\nConfusion Matrix")
print(classification_report(final_ytest, undersample_pred, target_names=labels, digits=4))
Awesome! The undersampled NN performed pretty well, with recall and precision similar to what we saw from our oversampled logistic regression model earlier.
Let's see how our performance changes when we train our NN using oversampled data.
sm = SMOTE(sampling_strategy='minority', random_state=49)
Xsm_train, ysm_train = sm.fit_sample(final_Xtrain, final_ytrain)
NN_oversample = Sequential([Dense(Xsm_train.shape[1], input_shape=(Xsm_train.shape[1], ), activation='relu'),
Dense(32, activation='relu'),
Dense(2, activation='softmax')])
NN_oversample.summary()
NN_oversample.compile(Adam(lr=0.001), metrics=['accuracy'], loss='sparse_categorical_crossentropy')
NN_oversample.fit(Xsm_train, ysm_train, validation_split=0.2, batch_size=300, epochs=20, shuffle=True, verbose=2)
oversample_pred = NN_oversample.predict_classes(final_Xtest)
oversample_smote = confusion_matrix(final_ytest, oversample_pred)
plt.figure(figsize=(6,5))
plot_cm(oversample_smote, labels, title="SMOTE Oversample\nConfusion Matrix ", cmap=plt.cm.Greens)
print(classification_report(final_ytest, oversample_pred, target_names=labels, digits=4))
Awesome! The oversampled NN performed very well, with high f1 scores on both safe and fraud transactions, and exceptionally high fraud precision.
While the oversampled NN miscalssified a handful more fraud transactions than the undersampled NN, the undersampled NN missclassified almost 200x more safe transactions. It's important to recognize that the cost of missclassifying a safe transaction is non-zero. Missclassifying a safe transaction blocks the cardholder from making additional transactions until that cardholder can verfiy their account, which costs the cardholder and the financial institution. Overall, it's probably safe to assume that the cost of missclassifying 1 fraud transactions does not outweight the cost of blocking ~300 safe transactions. Thus, we assume the oversampled NN is the best model overall.
While the plan is to continue to iterate on this project, below is a summary of what we've learned so far.
It's important to always explore the distributions of features in a dataset before beginning any predictive modeling. This helps with many things, including identifying which features need to be scaled, uncovering imabalances in our data, and locating outliers, nulls, and other important values.
When determining feature correlations in a highly imbalanced dataset, it's important to first balance our data through a method like random sampling. If we don't first balance our data, any correlations we derive will overfit to the majority class, and we won't have a clear view of which features most influence the minoirty class.
In our analysis, we balanced our data through random undersampling of safe transactions. We were then able to accurately determine correlations and identify which features would most influence our predictions. From those highly correlated features, we were able to identify and remove the most impactful extreme outliers.
We can use dimensionality reduction techniques like T-SNE, PCA & Truncated SVD to get a sense of how well our classifiers might perform.
To do this, we first reduced the features of our dataset to just two, and visualized the results in the xy-plane. Highlighting the different classes in this visualization, we saw how easily seperable or "clustered" each class was, which gave us an indication that our classification models would perform well down the line.
In dealing with this imbalanced dataset, it was important that we first balance our data before training our classification models. Otherwise, our models would have overfit to the majority class, assuming that practically all transactions are safe.
When cross-validating, however, we took two approaches:
First, we cross-validated on the undersampled (balanced) dataset. This allowed us to see which model best recognized fraud transactions. The performance metrics from this cross-validation approach (accuracy, precision, recall, etc.) were not representative of what we would expect to see in production, however, because the balanced data did not represent the imbalance we would expect to see in production.
Second, to get a true view of our model's performances, we cross-validated on the original imbalanced data. This dataset was representative of the imbalance we would expect to see in production, thus cross-validating on this dataset gave an accurate view of our performance metrics.
We began by evaluating the performance of four different models: Logistic Regression, K Nearest Neighbors Classifier, Support Vector Classifier, and the Decision Tree Classifier. We made sure to optimize the hyperparameters for each model using GridSearchCV, trained and cross-validated our models on the undersampled (balanced) data, and checked learning curves to confirm our models were not over/underfitting.
To evaluate which model performed best at identifying fraud, we looked at both cross-validation accuracy and ROC_AUC on our undersampled (balanced) training data. We further evaluated each model on our undersampled test data, creating confusion matrices for each.
In the end, logistic regression performed the best at identifying fraud, thus we chose to focus on this model moving forward.
Proceeding with our logistic regression model trained on undersampled data, we cross-validated the model using the original (imbalanced) training data to get a true view of its performance. We saw that our true performance metrics were lower than when we cross-validated on the undersampled data, especially precision (which made sense, as the imbalanced data consists of far more safe transactions with the potential to be misclassified as fraud).
Using our final test data, we looked at our model's precision-recall curve. We found relatively low scores overall, further indicating that our model did not do a great job of keeping safe transactions from being missclassified as the decision boundary was relaxed.
Next, we went back to the drawing board and trained logistic regression on oversampled data using the SMOTE technique. We optimized hyperparameters with RandomizedSearchCV this time, due to the large size of the training data. Upon cross-validating with the original (imbalanced) training data, we saw significantly higher precision and accuracy than from our previous logistic regression model trained on undersampled data. The reason for this is that we don't lose information about safe transction when oversampling (SMOTE), thus we can better classify them.
We saw that while both models performed similarly on the undersampled (balanced) test data, the oversampled logistic regression model produced much higher accuracy on the final (imbalanced) test data. Furthermore, the oversampled logistic regression model far outperformed the undersampled model in terms of precision-recall.
Finally, we put our baseline model aside, and investigated the performance of neural networks. We built two simple NNs: the first was trained on the undersampled data, the second was trained on the oversampled data.
After evaluating both NNs on our original test data, we found that the oversampled NN also outperformed the undersampled NN (assuming 1 fraud is less important than ~300 blocked accounts).
Balancing our dataset using oversampling (SMOTE) helped us improve our logistic regression and Neural Network models over random undersampling. While it's possible that undersampling could outperform oversampling in certain scenarios (e.g. depending on the tradeoff between false positives and false negatives), in general oversampling preserves more of the original data, and thus models trained with oversampling perform more accurate classification. Of course, the tradeoff with oversampling is that it uses more time/resources, given it consists of more training data.
Next Steps: Recall that we removed outliers from our undersampled data before training our models. We should also do the same for our oversampled data to see if our performance improves further.
! jupyter nbconvert --to html Credit_Card_Fraud_Detection.ipynb