Class Imbalance and Problem Statement

Class imbalance is a common problem when building classifiers in the machine learning world, and our awesome previously-scraped croc reviews data is unfortunately not so awesome from a class balance standpoint. Soon, we’ll assign binary class labels based on the rating a customer gave with their review where we’ll consider ratings of 2 stars (out of 5) or less to be negative sentiment and the remaining reviews as positive sentiment. As you’ll see in a moment, the vast majority of reviews belong to the positive sentiment class, and I think that’s great!

However, I don’t believe Crocs reached the top of the shoe game by mistake. I’d be willing to bet the creators behind Crocs are more than willing to confront their flaws and improve upon them. Let’s pretend the good people behind Crocs have asked us to build an ML model to effectively classify the rare negative review for their product despite the severe class imbalance. They don’t mind a few misclassifications of the positive reviews here and there but would prefer there aren’t a ton of these instances.

Quick Note on Positive / Negative Lingo

Traditionally, the positive class in a binary labeled dataset is the minority class of most interest / importance, and that will hold true in this project. Apologies for any confusion, but going forward when I reference the positive class, I will be referencing the set of negative sentiment reviews.

  • Positive Class (1) - Negative Sentiment Review :(
  • Negative Class (0) - Positive Sentiment Review :)

Below is a look at the class imbalance of our data.

import pandas as pd
from collections import Counter
import seaborn as sns

df = pd.read_csv('/content/drive/MyDrive/croc_reviews.csv')
df['label'] = [0 if each > 2 else 1 for each in df['rating']]
sns.countplot(x = df['label'], hue = df['label'])
print("Only", "{:.0%}".format(label_count[1] / (label_count[1] + label_count[0])), "of our data belongs to the positive class.")
Only 6% of our data belongs to the positive class.

png

Data Clean Up

Before we start addressing the class imbalance issue, let’s clean up the reviews using the text cleaning function we’ve used before.

import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

def clean_text(text):
    # Remove punctuation
    text = re.sub(f"[{re.escape(string.punctuation)}]", "", text)

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Convert to lowercase
    text = text.lower()

    # Remove stopwords
    nltk.download('stopwords', quiet=True)
    stop_words = set(stopwords.words('english'))

    tokens = nltk.word_tokenize(text)
    tokens = [word for word in tokens if word.lower() not in stop_words]

    # Stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]

    # Join the tokens back into a single string
    cleaned_text = ' '.join(tokens)

    return cleaned_text

df['clean_review'] = [clean_text(text_to_clean) for text_to_clean in df['review']]
df
reviewdateratinglabelclean_review
0!!!!!! E X C E L L E N T!!!!!!!!!!April 7, 20225.00e x c e l l e n
1"They're crocs; people know what crocs are."April 3, 20215.00theyr croc peopl know croc
2- Quick delivery and the product arrived when ...March 19, 20235.00quick deliveri product arriv compani said woul...
3...amazing "new" color!! who knew?? love - lov...July 17, 20225.00amaz new color knew love love love
40 complaints from me; this is the 8th pair of ...June 4, 20215.00complaint th pair croc ive bought like two mon...
..................
9233I will definitely be buying again in many colo...August 25, 20214.00definit buy mani color reason materi feel thin...
9234I wish I would have bought crocs a long time ago.April 8, 20215.00wish would bought croc long time ago
9235wonderful. Gorgeous blue; prettier in person!April 27, 20225.00wonder gorgeou blue prettier person
9236Wonerful. Very comfy, and there are no blister...April 8, 20215.00woner comfi blister feet unlik brand one
9237Work from home - high arch need good support a...May 22, 20235.00work home high arch need good support comfort ...

9238 rows × 5 columns

Strategy: Performance Metrics and Dealing with Imbalanced Data

Because we’re dealing with imbalanced data and are most concerned with identifying the minority / positive class, we will focus on improving the recall score of our models on the test set. We will also watch F2 score, a modifed version of F1 score that increases the importance of recall in its calculation. Why F2 score? We are concerned with maximizing recall, but a model that predicts the minority class 100% of the time would achieve a perfect recall score. That doesn’t help us very much, and F2 score will give us an understanding of how well the model can differentiate between the two classes along with how well the model can identify positive samples. Below are the formulas to calculate the performance metrics.

  • Recall - True Positive / (True Positive + False Negative)
  • Precision - True Positive / (True Positive + False Positive)
  • F2 Score - (5 * Precision * Recall) / (4 * Precision + Recall)

There are a lot of ways to address the imbalanced data problem when training a classifier. In this project we’re going to adopt the following strategy:

  1. Implement multiple methods for resampling the data
  2. Train multiple baseline models using each of the resampling methods to see which resampling / model combo performs the best out of the box based on F2 and recall score
  3. Use GridsearchCV to tune the best baseline model for recall score whilst continuing to use the best resampling technique
  4. Alter the model’s decision threshold to maximize recall score

Step 1: Implement Multiple Resampling Methods

Thank goodness for the Imbalanced-Learn library because it makes data resampling much easier. The term resampling refers to the practice of balancing data by either selecting a subset of the majority class equal in size to the minority class (known as under-sampling) or artificially making the minority class as large as the majority (known as over-sampling). There are many different techniques for doing these processes, but we’ll try out the following:

  • Random Over-sampling - Randomly selects samples from the minority class and replaces them in the dataset.
  • Random Under-sampling - Randomly selects and removes samples from the majority class.
  • SMOTE Over-sampling - Synthetic Minority Over-sampling Technique (SMOTE) generates new samples for the minority class by imitating its features. The original paper explaining the technique can be found here.
  • Cluster Under-sampling - Under-samples the majority class by replacing a cluster of majority samples by the cluster centroid of a KMeans algorithm. Keeps N majority samples by fitting the KMeans algorithm with N clusters to the majority class and keeping the centroids. The original paper explaining the technique can be found here
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import fbeta_score, recall_score, accuracy_score, confusion_matrix, precision_recall_curve
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler, ClusterCentroids
import numpy as np

def rand_oversample(X, y):
  ros = RandomOverSampler(random_state=42)
  X_res, y_res = ros.fit_resample(X, y)
  return {'X_train': X_res, 'y_train':y_res}

def rand_undersample(X, y):
  rus = RandomUnderSampler(random_state=42)
  X_res, y_res = rus.fit_resample(X, y)
  return {'X_train': X_res, 'y_train':y_res}

def smote_oversample(X,y):
  sm = SMOTE(random_state=42)
  X_res, y_res = sm.fit_resample(X, y)
  return {'X_train': X_res, 'y_train':y_res}

def cluster_undersample(X, y):
  cc = ClusterCentroids(
    estimator=MiniBatchKMeans(n_init=1, random_state=0), random_state=42)
  X_res, y_res = cc.fit_resample(X, y)
  return {'X_train': X_res, 'y_train':y_res}

Holdout test set

Let’s split our data into a train and test set. We will not alter the balance of the test set because we won’t be able to do that in the real world. However, our train set will undergo resampling as we find the best baseline model and resampling combo.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

def show_imb(label):
  imb = Counter(label)
  return str(imb[0]) + ' Negative Samples | ' + str(imb[1]) + ' Positive Samples'

# We'll be using TF-IDF for this project
vectorizer = TfidfVectorizer()
vect_df = pd.DataFrame(vectorizer.fit_transform(df['clean_review']).todense(), columns=vectorizer.get_feature_names_out())
vect_df['label'] = df['label'].copy()

# Train / test split
df_train, df_test = train_test_split(vect_df, test_size = .1, random_state = 0)

y_train, y_test = df_train['label'].tolist(), df_test['label'].tolist()
X_train, X_test = df_train.drop(['label'], axis = 1), df_test.drop(['label'], axis = 1)

print('Holdout Test Set Class Balance: ' + show_imb(y_test))
Holdout Test Set Class Balance: 872 Negative Samples | 52 Positive Samples

Step 2: Train Multiple Baseline Classifiers Using Resampling Methods

Let’s alter our training data using each of the resampling methods and train using each one. We’ll also train using the unsampled original training data as a baseline.

sampled_data_dict = {'no_sample':           {'X_train': X_train, 'y_train':y_train},
                     'random_oversample':   rand_oversample(X_train, y_train),
                     'random_undersample':  rand_undersample(X_train, y_train),
                     'smote_oversample':    smote_oversample(X_train, y_train),
                     'cluster_undersample': cluster_undersample(X_train, y_train)}

# A look at the class balance following resampling
for s in sampled_data_dict.keys():
  txt = 'Class Balance with {:<25}' + show_imb(sampled_data_dict[s]['y_train'])
  print(txt.format(s + ':'))
Class Balance with no_sample:               7767 Negative Samples | 547 Positive Samples
Class Balance with random_oversample:       7767 Negative Samples | 7767 Positive Samples
Class Balance with random_undersample:      547 Negative Samples | 547 Positive Samples
Class Balance with smote_oversample:        7767 Negative Samples | 7767 Positive Samples
Class Balance with cluster_undersample:     547 Negative Samples | 547 Positive Samples

As you can see, all the resampled training sets are now balanced except for the basline with no resampling. Let’s train some baseline classifiers with each resampling method and view the results on the unaltered test set.

from tqdm.notebook import tqdm
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

# Baseline Models
models_dict = {'Naive Bayes':         GaussianNB(),
               'Random Forest':       RandomForestClassifier(max_depth=10, random_state=0),
               'Logistic Regression': LogisticRegression(random_state=0)}

# Dataframe to view test results
results_df = pd.DataFrame(columns = ['model_name', 'sampling_type', 'model', 'fbeta', 'recall'])

# Iterate over all models and all resampling techniques
for m in tqdm(models_dict.keys()):
  for d in tqdm(sampled_data_dict.keys()):
    # Fit model on resampled data
    model = models_dict[m].fit(sampled_data_dict[d]['X_train'], sampled_data_dict[d]['y_train'])

    # Predict on holdout test set
    preds = model.predict(X_test)

    # Add test results to dataframe
    results = [m, d, model, fbeta_score(y_test, preds, beta = 2), recall_score(y_test, preds)]
    results_df.loc[-1] = results
    results_df.index = results_df.index + 1
    results_df = results_df.sort_index()

results_df.sort_values(by = 'recall', ascending = False)
model_namesampling_typemodelfbetarecall
5Random Forestcluster_undersample(DecisionTreeClassifier(max_depth=10, max_feat...0.4427080.980769
0Logistic Regressioncluster_undersampleLogisticRegression(random_state=0)0.6345180.961538
2Logistic Regressionrandom_undersampleLogisticRegression(random_state=0)0.6648200.923077
7Random Forestrandom_undersample(DecisionTreeClassifier(max_depth=10, max_feat...0.5744130.846154
3Logistic Regressionrandom_oversampleLogisticRegression(random_state=0)0.6847130.826923
8Random Forestrandom_oversample(DecisionTreeClassifier(max_depth=10, max_feat...0.5982910.807692
1Logistic Regressionsmote_oversampleLogisticRegression(random_state=0)0.6655840.788462
10Naive Bayescluster_undersampleGaussianNB()0.2884620.750000
11Naive Bayessmote_oversampleGaussianNB()0.2863440.750000
13Naive Bayesrandom_oversampleGaussianNB()0.2838430.750000
14Naive Bayesno_sampleGaussianNB()0.2838430.750000
6Random Forestsmote_oversample(DecisionTreeClassifier(max_depth=10, max_feat...0.5236490.596154
12Naive Bayesrandom_undersampleGaussianNB()0.3414630.538462
4Logistic Regressionno_sampleLogisticRegression(random_state=0)0.1152070.096154
9Random Forestno_sample(DecisionTreeClassifier(max_depth=10, max_feat...0.0000000.000000

In terms of recall the random forest / cluster undersampling combo is the best, but it is quite weak in terms of F2 score, which means it will likely give us a lot of false positives in its predictions. The logistic regression / cluster undersample performs similarly in terms of recall, but maintains a much better F2 score, so we’ll move forward with that model! Let’s have a better look at its performance out of the box using a confusion matrix.

def show_confusion(y_test, pred):
  cf = confusion_matrix(y_test, pred)
  group_names = ["True Negatives","False Positives","False Negatives","True Positives"]
  group_perc = [cf.flatten()[0] / (cf.flatten()[0] + cf.flatten()[1]), cf.flatten()[1] / (cf.flatten()[0] + cf.flatten()[1])
                  , cf.flatten()[2] / (cf.flatten()[2] + cf.flatten()[3]), cf.flatten()[3] / (cf.flatten()[2] + cf.flatten()[3])]
  group_perc_str = ["{:.0%}".format(each) for each in group_perc]
  group_counts = ["{0:0.0f}".format(value) for value in cf.flatten()]
  labels = labels = [f"{v1} {v2}\n{v3} of class total" for v1, v2, v3 in zip(group_counts, group_names, group_perc_str)]
  labels = np.asarray(labels).reshape(2,2)
  group_perc = np.asarray(group_perc).reshape(2,2)
  print(sns.heatmap(group_perc, annot=labels, fmt = '', cmap='Greens'))

best_baseline_model = results_df.loc[(results_df['model_name'] == 'Logistic Regression') & (results_df['sampling_type'] == 'cluster_undersample')]['model'][0]
show_confusion(y_test, best_baseline_model.predict(X_test))

png

We’re already in a really good spot as we are correctly identifying the positive class over 96% of the time. We have more false positives than I’d like, but maybe some hyperparameter tuning can help here.

Step 3: Finding Best Hyperparams with GridsearchCV

For finding the best logistic regression hyperparams, we will use Scikit-Learn’s GridSearchCV in combination with an Imblearn Pipeline. GridsearchCV performs an exhaustive search over the provided parameters and performs cross-validation for every parameter combination. Imblearn’s Pipeline allows us to perform the same resampling technique we chose in step two with GridsearchCV. The pipeline will resample the training data using cluster undersampling but will not touch the balance of the test data in the fold. We’ll setup GridsearchCV to optimize for recall score and have it return the model that performed the best in terms of average recall across all folds.

from imblearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

model = Pipeline([
        # Perform resampling on train split
        ('sampling', ClusterCentroids(estimator=MiniBatchKMeans(n_init=1, random_state=0), random_state=42)),
        ('classification', LogisticRegression(random_state = 42))
    ])

parameters = {'classification__solver':('liblinear', 'lbfgs'), 'classification__C':[1, 5, 10]}

# Returns best model in terms of recall score
best_grid_model = GridSearchCV(model, param_grid = parameters, cv = 5, refit = True, scoring = 'recall')
best_grid_model.fit(X_train, y_train)

# A look at how well all hyperparameter combinations perform
pd.DataFrame(best_grid_model.cv_results_).sort_values(by = 'rank_test_score')
mean_fit_timestd_fit_timemean_score_timestd_score_timeparam_classification__Cparam_classification__solverparamssplit0_test_scoresplit1_test_scoresplit2_test_scoresplit3_test_scoresplit4_test_scoremean_test_scorestd_test_scorerank_test_score
229.7158233.3060740.0689210.0163345liblinear{'classification__C': 5, 'classification__solv...0.9082570.9357800.9363640.9090910.8899080.9158800.0178571
329.6444153.5986770.0953780.0268575lbfgs{'classification__C': 5, 'classification__solv...0.9082570.9357800.9363640.9090910.8899080.9158800.0178571
029.1878422.7673060.0575470.0191411liblinear{'classification__C': 1, 'classification__solv...0.8990830.9357800.9272730.9000000.8899080.9104090.0178043
128.6896453.5726460.1049580.0276261lbfgs{'classification__C': 1, 'classification__solv...0.8990830.9357800.9272730.9000000.8899080.9104090.0178043
428.6058012.0472660.0528580.00515010liblinear{'classification__C': 10, 'classification__sol...0.9082570.9266060.9181820.8909090.8899080.9067720.0145725
532.2359263.8196240.0995140.02092210lbfgs{'classification__C': 10, 'classification__sol...0.9082570.9266060.9181820.8909090.8899080.9067720.0145725

A look at the confusion matrix following hyperparameter tuning.

show_confusion(y_test, best_grid_model.predict(X_test))

png

This is bit better. Our false positive rate has reduced slightly, meaning that this new model is having an easier time differentiating between the two classes than before. In many cases this would suffice, but what if we attempted to let no positive samples to slip past our classifier? This is where the fourth step can come in, altering the decision threshold of the model.

Step 4: Altering the Decision Threshold

Altering the decision threshold will allow us to catch more true positive samples by increasing the sensitivity of the model. This increase in sensitivity will come at a cost, however, causing a decrease in the precision of our model. But, that’s ok! We are most concerned with catching the true positives in this data and aren’t nearly as worried about false positives. Below is a look our model’s precision-recall curve to get a better understanding of the tradeoff between the two measures prior to changing the decision threshold.

from sklearn.metrics import precision_recall_curve

# Get class probability scores
y_scores = best_grid_model.predict_proba(X_test)[:, 1]

# Returns precision-recall pairs for different probability thresholds
p, r, thresholds = precision_recall_curve(y_test, y_scores)

viz_df = pd.DataFrame({'Precision':p, 'Recall':r})
sns.lineplot(data = viz_df, x = "Recall", y = "Precision")

png

For us to achieve perfect recall, the model’s precision is going to suffer quite a bit. Again, that’s OK becuase of this project’s main goal to effectively catch positive samples. Let’s iteratively lower the decision threshold until we achieve perfect recall.

def adjusted_classes(y_scores, t):
    # If pred probability is greater than or equal to threshold it is positive
    return [1 if y >= t else 0 for y in y_scores]

# Default sklearn logistic regression threshold
init_threshold = 0.5
pred = adjusted_classes(y_scores, init_threshold)
recall = recall_score(y_test, pred)

# Stops when recall is perfect
while recall != 1.0:
  init_threshold = init_threshold - 0.001
  pred = adjusted_classes(y_scores, init_threshold)
  recall = recall_score(y_test, pred)

print("Maximum Threshold for Perfect Recall:", init_threshold)
show_confusion(y_test, pred)
Maximum Threshold for Perfect Recall:  0.3679999999999999

png

There you have it - a very sensitive model that doesn’t always spit out the positive class!