AI & Python #25: Let's Build Your First Machine Learning Model in Python
A complete guide to build a basic ML model.
If youāre not into coding, go to settings and turn off notifications for āAI & Pythonā (leave the rest the same to keep receiving my other emails)
If youāre learning Python and would like to develop a machine learning model then a library that you want to seriously consider is scikit-learn. Scikit-learn (also known as sklearn) is a machine learning library used in Python that provides many unsupervised and supervised learning algorithms.
In this simple guide, weāre going to create a machine learning model that will predict whether a movie review is positive or negative. This is known as binary text classification and will help us explore the scikit-learn library while building a basic machine learning model from scratch. Below are the concepts weāre going to learn in this guide.
Table of Contents
1. The Dataset and The Problem to Solve
2. Preparing The Data
- Reading the dataset
- Dealing with Imbalanced Classes
- Splitting data into train and test set
3. Text Representation (Bag of Words)
- CountVectorizer
- Term Frequency, Inverse Document Frequency (TF-IDF)
- Turning our text data into numerical vectors
4. Model Selection
- Supervised vs Unsupervised learning
- Support Vector Machines (SVM)
- Decision Tree
- Naive Bayes
- Logistic Regression
5. Model Evaluation
- Mean Accuracy
- F1 Score
- Classification report
- Confusion Matrix
6. Tuning the Model
- GridSearchCV
The Dataset and The Problem toĀ Solve
š Dataset: In this guide, weāll use an IMDB dataset of 50k movie reviews available on Kaggle. The dataset contains 2 columns (review and sentiment) that will help us identify whether a review is positive or negative.
Problem formulation: Our goal is to find which machine learning model is best suited to predict sentiment (output) given a movie review (input).
Preparing TheĀ Data
Reading theĀ dataset
After you download the dataset, make sure the file is in the same place where your Python script is located. Then, weāll read the file using the Pandas library.
import pandas as pd
df_review = pd.read_csv('IMDB Dataset.csv')
df_review
Note: If you donāt have some of the libraries used in this guide, you can easily install a library with pip on your terminal or command prompt (e.g.,pip install scikit-learn)
The dataset looks like the picture below.
This dataset contains 50000 rows; however, to train our model faster in the following steps, weāre going to take a smaller sample of 10000 rows. This small sample will contain 9000 positive and 1000 negative reviews to make the data imbalanced (so I can teach you undersampling and oversampling techniques in the next step)
Weāre going to create this small sample with the following code. The name of this imbalanced dataset will bedf_review_imb
df_positive = df_review[df_review['sentiment']=='positive'][:9000]
df_negative = df_review[df_review['sentiment']=='negative'][:1000]
df_review_imb = pd.concat([df_positive, df_negative])
Dealing with Imbalanced Classes
In most cases, youāll have a large amount of data for one class, and much fewer observations for other classes. This is known as imbalanced data because the number of observations per class is not equally distributed.
Letās take a look at how our df_review_imb
dataset is distributed.
As we can see there are more positive than negative reviews in df_review_imb
so we have imbalanced data.
To resample our data we use the imblearn
library. You can either undersample positive reviews or oversample negative reviews (you need to choose based on the data youāre working with). In this case, weāll use the RandomUnderSampler
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
df_review_bal, df_review_bal['sentiment']=rus.fit_resample(df_review_imb[['review']],
df_review_imb['sentiment'])
df_review_bal
First, we create a new instance of RandomUnderSampler (rus)
, we add random_state=0
just to control the randomization of the algorithm. Then we resample the imbalanced dataset df_review_imb
by fitting rus
with rus.fit_resample(x, y)
where āxā contains the data which have to be sampled and āyā corresponds to labels for each sample in āxā.
After this, x
and y
are balanced and weāll store it in a new dataset named df_review_bal.
We can compare the imbalanced and balanced dataset with the following code.
IN [0]: print(df_review_imb.value_counts(āsentimentā))
print(df_review_bal.value_counts(āsentimentā))
OUT [0]:positive 9000
negative 1000
negative 1000
positive 1000
As we can see, now our dataset is equally distributed.
Note 1: If you get the following error when using the RandomUnderSampler
IndexError: only integers, slices (`:`), ellipsis (`ā¦`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
You can use an alternative to RandomUnderSampler
. Try the code below:
# option 2
length_negative = len(df_review_imb[df_review_imb['sentiment']=='negative'])
df_review_positive = df_review_imb[df_review_imb['sentiment']=='positive'].sample(n=length_negative)
df_review_non_positive = df_review_imb[~(df_review_imb['sentiment']=='positive')]
df_review_bal = pd.concat([
df_review_positive, df_review_non_positive
])
df_review_bal.reset_index(drop=True, inplace=True)
df_review_bal['sentiment'].value_counts()
The df_review_bal
dataframe should have now 1000 positive and negative reviews as shown above.
Splitting data into train and testĀ set
Before we work with our data, we need to split it into a train and test set. The train
dataset will be used to fit the model, while the test
dataset will be used to provide an unbiased evaluation of a final model fit on the training dataset.
Weāll use sklearnās train_test_split
to do the job. In this case, we set 33% to the test data.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df_review_bal, test_size=0.33, random_state=42)
Now we can set the independent and dependent variables within our train and test set.
train_x, train_y = train['review'], train['sentiment']
test_x, test_y = test['review'], test['sentiment']
Letās see what each of them mean:
train_x: Independent variables (review) that will be used to train the model. Since we specified
test_size = 0.33
, 67% of observations from the data will be used to fit the model.train_y: Dependent variables (sentiment) or target label that need to be predicted by this model.
test_x: The remaining
33%
of independent variables that will be used to make predictions to test the accuracy of the model.test_y: Category labels that will be used to test the accuracy between actual and predicted categories.
Text Representation (Bag ofĀ Words)
Keep reading with a 7-day free trial
Subscribe to Artificial Corner to keep reading this post and get 7 days of free access to the full post archives.