Multiclass text classification using BERT
a tutorial on mult-class text classfication using pretrained BERT model from HuggingFace
- Loading data
- Tokenization
- Creating Datasets and DataLoaders
- Bert For Sequence Classification Model
- Fine-tuning
- Performance Metrics
- Error Analysis
- Prediction
In this post, we'll do a simple text classification task using the pretained BERT model from HuggingFace.
The BERT model was proposed in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
Loading data
We'll use the emotion dataset from the Hugging Face Hub.
from datasets import load_dataset
emotion = load_dataset('emotion')
The emotion dataset consists of three sets: train, validation, and test set, and has six kinds of emotion: sadness, joy, love, anger, fear, and surprise.
emotion
label_names = emotion["train"].features['label'].names
label_names
Let's take a look at what the text is like:
emotion.set_format(type="pandas")
train_df = emotion['train'][:]
valid_df = emotion['validation'][:]
test_df = emotion['test'][:]
train_df.head()
In this post, we'll just use 350 samples from each class for training, and 70 samples for validation and 50 for testing.
train_df = train_df.groupby('label').apply(lambda x: x.sample(350)).reset_index(drop=True)
valid_df = valid_df.groupby('label').apply(lambda x: x.sample(70)).reset_index(drop=True)
test_df = test_df.groupby('label').apply(lambda x: x.sample(50)).reset_index(drop=True)
train_df['label'].value_counts()
valid_df['label'].value_counts()
test_df['label'].value_counts()
Tokenization is a process for spliting raw texts into tokens, and encoding the tokens into numeric data.
To do this, we first initialize a BertTokenizer
:
from transformers import BertTokenizer
PRETRAINED_LM = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(PRETRAINED_LM, do_lower_case=True)
tokenizer
define a function for encoding:
def encode(docs):
'''
This function takes list of texts and returns input_ids and attention_mask of texts
'''
encoded_dict = tokenizer.batch_encode_plus(docs, add_special_tokens=True, max_length=128, padding='max_length',
return_attention_mask=True, truncation=True, return_tensors='pt')
input_ids = encoded_dict['input_ids']
attention_masks = encoded_dict['attention_mask']
return input_ids, attention_masks
Use the ecode
function to get input ids and attention masks of the datasets:
train_input_ids, train_att_masks = encode(train_df['text'].values.tolist())
valid_input_ids, valid_att_masks = encode(valid_df['text'].values.tolist())
test_input_ids, test_att_masks = encode(test_df['text'].values.tolist())
We'll use pytorch Dataset
and DataLoader
to split data into batches. For more detatils, you can check out another post on DataLoader.
Turn the labels into tensors:
import torch
train_y = torch.LongTensor(train_df['label'].values.tolist())
valid_y = torch.LongTensor(valid_df['label'].values.tolist())
test_y = torch.LongTensor(test_df['label'].values.tolist())
train_y.size(),valid_y.size(),test_y.size()
Create dataloaders for training
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
BATCH_SIZE = 16
train_dataset = TensorDataset(train_input_ids, train_att_masks, train_y)
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=BATCH_SIZE)
valid_dataset = TensorDataset(valid_input_ids, valid_att_masks, valid_y)
valid_sampler = SequentialSampler(valid_dataset)
valid_dataloader = DataLoader(valid_dataset, sampler=valid_sampler, batch_size=BATCH_SIZE)
test_dataset = TensorDataset(test_input_ids, test_att_masks, test_y)
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=BATCH_SIZE)
Bert For Sequence Classification Model
We will initiate the BertForSequenceClassification
model from Huggingface, which allows easily fine-tuning the pretrained BERT mode for classification task.
You will see a warning that some parts of the model are randomly initialized. This is normal since the classification head has not yet been trained.
from transformers import BertForSequenceClassification
N_labels = len(train_df.label.unique())
model = BertForSequenceClassification.from_pretrained(PRETRAINED_LM,
num_labels=N_labels,
output_attentions=False,
output_hidden_states=False)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device
model = model.cuda()
An optimizer is for tuning parameters in the model, which is set up with a learning rate.
Selection of the learning rate is important. In practice, it's common to use a scheduler to decrease the learning rate during training.
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
EPOCHS = 30
LEARNING_RATE = 2e-6
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer,
num_warmup_steps=0,
num_training_steps=len(train_dataloader)*EPOCHS )
The training loop is where the magic of deep learning happens. The model will be fine-tuned on the emotion dataset for classification task.
from torch.nn.utils import clip_grad_norm_
from tqdm.notebook import tqdm
import numpy as np
import math
train_loss_per_epoch = []
val_loss_per_epoch = []
for epoch_num in range(EPOCHS):
print('Epoch: ', epoch_num + 1)
'''
Training
'''
model.train()
train_loss = 0
for step_num, batch_data in enumerate(tqdm(train_dataloader,desc='Training')):
input_ids, att_mask, labels = [data.to(device) for data in batch_data]
output = model(input_ids = input_ids, attention_mask=att_mask, labels= labels)
loss = output.loss
train_loss += loss.item()
model.zero_grad()
loss.backward()
del loss
clip_grad_norm_(parameters=model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
train_loss_per_epoch.append(train_loss / (step_num + 1))
'''
Validation
'''
model.eval()
valid_loss = 0
valid_pred = []
with torch.no_grad():
for step_num_e, batch_data in enumerate(tqdm(valid_dataloader,desc='Validation')):
input_ids, att_mask, labels = [data.to(device) for data in batch_data]
output = model(input_ids = input_ids, attention_mask=att_mask, labels= labels)
loss = output.loss
valid_loss += loss.item()
valid_pred.append(np.argmax(output.logits.cpu().detach().numpy(),axis=-1))
val_loss_per_epoch.append(valid_loss / (step_num_e + 1))
valid_pred = np.concatenate(valid_pred)
'''
Loss message
'''
print("{0}/{1} train loss: {2} ".format(step_num+1, math.ceil(len(train_df) / BATCH_SIZE), train_loss / (step_num + 1)))
print("{0}/{1} val loss: {2} ".format(step_num_e+1, math.ceil(len(valid_df) / BATCH_SIZE), valid_loss / (step_num_e + 1)))
You can see in the output that the training and validation losses steadily decreases in each epoch.
from matplotlib import pyplot as plt
epochs = range(1, EPOCHS +1 )
fig, ax = plt.subplots()
ax.plot(epochs,train_loss_per_epoch,label ='training loss')
ax.plot(epochs, val_loss_per_epoch, label = 'validation loss' )
ax.set_title('Training and Validation loss')
ax.set_xlabel('Epochs')
ax.set_ylabel('Loss')
ax.legend()
plt.show()
from sklearn.metrics import classification_report
print('classifiation report')
print(classification_report(valid_pred, valid_df['label'].to_numpy(), target_names=label_names))
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
def plot_confusion_matrix(y_preds, y_true, labels=None):
cm = confusion_matrix(y_true, y_preds, normalize="true")
fig, ax = plt.subplots(figsize=(6, 6))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
plt.title("Normalized confusion matrix")
plt.show()
plot_confusion_matrix(valid_pred,valid_df['label'].to_numpy(),labels=label_names)
You can see that sadness has a higher likelihood to be classified as anger or fear, leading to a lower f1 score.
Now let's use the trained model to predict the testing set.
model.eval()
test_pred = []
test_loss= 0
with torch.no_grad():
for step_num, batch_data in tqdm(enumerate(test_dataloader)):
input_ids, att_mask, labels = [data.to(device) for data in batch_data]
output = model(input_ids = input_ids, attention_mask=att_mask, labels= labels)
loss = output.loss
test_loss += loss.item()
test_pred.append(np.argmax(output.logits.cpu().detach().numpy(),axis=-1))
test_pred = np.concatenate(test_pred)
print('classifiation report')
print(classification_report(test_pred, test_df['label'].to_numpy(),target_names=label_names))
With the predictions, we can plot the confusion matrix again:
plot_confusion_matrix(test_pred,test_df['label'].to_numpy(),labels=label_names)
Output the misclassified text:
test_df['pred'] = test_pred
test_df.reset_index(level=0)
print(test_df[test_df['label']!=test_df['pred']].shape)
test_df[test_df['label']!=test_df['pred']][['text','label','pred']].head(10)