How to save the best model during model training and load it for inference in pytorch
- Load data
- Tokenization
- Creating Datasets and DataLoaders
- Bert For Sequence Classification Model
- Fine-tuning
- Performance Metrics
- Error Analysis
- Inference
- Reference
In this post, we'll demonstrate how to save the best model during model training.
Saving the best model is a good technique when we're not sure about the optimal number of epochs we should use for training.
We will use the same pipeline in this post to fine-tune a BERT model on a text classification task.
If you're familiar with the training progress, you can just read the subsections:
Load data
We'll use the emotion dataset from the Hugging Face Hub.
The emotion dataset consists of three sets: train, validation, and test set, and has six kinds of emotion: sadness, joy, love, anger, fear, and surprise.
emotion
label_names = emotion["train"].features['label'].names
label_names
Let's take a look at what the text is like:
emotion.set_format(type="pandas")
train_df = emotion['train'][:]
valid_df = emotion['validation'][:]
test_df = emotion['test'][:]
train_df.head()
In this post, we'll just use 350 samples from each class for training, and 70 samples for validation and 50 for testing.
train_df = train_df.groupby('label').apply(lambda x: x.sample(350)).reset_index(drop=True)
valid_df = valid_df.groupby('label').apply(lambda x: x.sample(70)).reset_index(drop=True)
test_df = test_df.groupby('label').apply(lambda x: x.sample(50)).reset_index(drop=True)
train_df['label'].value_counts()
valid_df['label'].value_counts()
test_df['label'].value_counts()
Tokenization is a process for spliting raw texts into tokens, and encoding the tokens into numeric data.
To do this, we first initialize a BertTokenizer
:
from transformers import BertTokenizer
PRETRAINED_LM = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(PRETRAINED_LM, do_lower_case=True)
tokenizer
define a function for encoding:
def encode(docs):
'''
This function takes list of texts and returns input_ids and attention_mask of texts
'''
encoded_dict = tokenizer.batch_encode_plus(docs, add_special_tokens=True, max_length=128, padding='max_length',
return_attention_mask=True, truncation=True, return_tensors='pt')
input_ids = encoded_dict['input_ids']
attention_masks = encoded_dict['attention_mask']
return input_ids, attention_masks
Use the ecode
function to get input ids and attention masks of the datasets:
train_input_ids, train_att_masks = encode(train_df['text'].values.tolist())
valid_input_ids, valid_att_masks = encode(valid_df['text'].values.tolist())
test_input_ids, test_att_masks = encode(test_df['text'].values.tolist())
We'll use pytorch Dataset
and DataLoader
to split data into batches. For more detatils, you can check out another post on DataLoader.
Turn the labels into tensors:
import torch
train_y = torch.LongTensor(train_df['label'].values.tolist())
valid_y = torch.LongTensor(valid_df['label'].values.tolist())
test_y = torch.LongTensor(test_df['label'].values.tolist())
train_y.size(),valid_y.size(),test_y.size()
Create dataloaders for training
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
BATCH_SIZE = 16
train_dataset = TensorDataset(train_input_ids, train_att_masks, train_y)
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=BATCH_SIZE)
valid_dataset = TensorDataset(valid_input_ids, valid_att_masks, valid_y)
valid_sampler = SequentialSampler(valid_dataset)
valid_dataloader = DataLoader(valid_dataset, sampler=valid_sampler, batch_size=BATCH_SIZE)
test_dataset = TensorDataset(test_input_ids, test_att_masks, test_y)
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=BATCH_SIZE)
Bert For Sequence Classification Model
We will initiate the BertForSequenceClassification
model from Huggingface, which allows easily fine-tuning the pretrained BERT mode for classification task.
You will see a warning that some parts of the model are randomly initialized. This is normal since the classification head has not yet been trained.
from transformers import BertForSequenceClassification
N_labels = len(train_df.label.unique())
model = BertForSequenceClassification.from_pretrained(PRETRAINED_LM,
num_labels=N_labels,
output_attentions=False,
output_hidden_states=False)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device
model = model.cuda()
An optimizer is for tuning parameters in the model.
The learnable parameters (i.e. weights and biases) of an torch.nn.Module
model are contained in the model’s parameters. We can access them with model.parameters()
.
Hence, we initialize an AdamW
optimizer with the model parameters and a learning rate using the following code:
from torch.optim import AdamW
LEARNING_RATE = 2e-6
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
Selection of the learning rate is important. In practice, it's common to use a scheduler to decrease the learning rate during training.
from transformers import get_linear_schedule_with_warmup
EPOCHS = 30
scheduler = get_linear_schedule_with_warmup(optimizer,
num_warmup_steps=0,
num_training_steps=len(train_dataloader)*EPOCHS )
The training loop is where the magic of deep learning happens. The model will be fine-tuned on the emotion dataset for classification task.
from tqdm.notebook import tqdm
from torch.nn.utils import clip_grad_norm_
def train(model, train_dataloader):
model.train()
train_loss = 0
for step_num, batch_data in enumerate(tqdm(train_dataloader,desc='Training')):
# get the inputs
input_ids, att_mask, labels = [data.to(device) for data in batch_data]
# zero the parameter gradients
model.zero_grad()
# forward + backward + optimize
output = model(input_ids = input_ids, attention_mask=att_mask, labels= labels)
loss = output.loss
loss.backward()
clip_grad_norm_(parameters=model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
train_loss += loss.item()
return train_loss / len(train_dataloader)
Define a validation loop to see how the model performs:
from tqdm.notebook import tqdm
import numpy as np
def evaluate(model, dataloader, desc= 'Validation'):
model.eval()
valid_loss = 0
valid_pred = []
with torch.no_grad():
for step_num_e, batch_data in enumerate(tqdm(dataloader,desc= desc)):
input_ids, att_mask, labels = [data.to(device) for data in batch_data]
output = model(input_ids = input_ids, attention_mask=att_mask, labels= labels)
loss = output.loss
valid_loss += loss.item()
valid_pred.append(np.argmax(output.logits.cpu().detach().numpy(),axis=-1))
valid_pred = np.concatenate(valid_pred)
return valid_loss/ len(dataloader), valid_pred
The best model is the one with the optimal set of parameters that yields the least validation loss.
import copy
train_loss_per_epoch = []
val_loss_per_epoch = []
best_val_loss = float('inf')
best_model = None
for epoch_num in range(EPOCHS):
print('Epoch: ', epoch_num + 1)
# Training
train_loss = train(model, train_dataloader)
train_loss_per_epoch.append(train_loss)
# Validation
valid_loss, valid_pred = evaluate(model, valid_dataloader)
val_loss_per_epoch.append(valid_loss)
# Loss message
print(f"train loss: {train_loss} | val loss: {valid_loss}" )
# save best model
if valid_loss < best_val_loss:
best_val_loss = valid_loss
best_model = copy.deepcopy(model)
The benefit of saving the best model is not evident in this post since, with 30 epochs, the validation loss is still steadily decreasing in every epoch.
But, if we train the model for 100 epochs, we may see the benefit.
torch.save(best_model.state_dict(), 'best_model.pt')
You can see that we're saving a model's state_dict
.
A state_dict
is a python dictionary object that maps each layer to its parameter tensor.
# Print model's state_dict
print("Model's state_dict:")
for param_tensor in model.state_dict():
print(param_tensor, "\t", model.state_dict()[param_tensor].size())
from matplotlib import pyplot as plt
epochs = range(1, EPOCHS +1 )
fig, ax = plt.subplots()
ax.plot(epochs,train_loss_per_epoch,label ='training loss')
ax.plot(epochs, val_loss_per_epoch, label = 'validation loss' )
ax.set_title('Training and Validation loss')
ax.set_xlabel('Epochs')
ax.set_ylabel('Loss')
ax.legend()
plt.show()
from sklearn.metrics import classification_report
print('classifiation report')
print(classification_report(valid_pred, valid_df['label'].to_numpy(), target_names=label_names))
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
def plot_confusion_matrix(y_preds, y_true, labels=None):
cm = confusion_matrix(y_true, y_preds, normalize="true")
fig, ax = plt.subplots(figsize=(6, 6))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
plt.title("Normalized confusion matrix")
plt.show()
plot_confusion_matrix(valid_pred,valid_df['label'].to_numpy(),labels=label_names)
You can see that sadness has a higher likelihood to be classified as anger or fear, leading to a lower f1 score.
Now let's use the best model to predict the testing set.
from transformers import BertForSequenceClassification
N_labels = len(train_df.label.unique())
model = BertForSequenceClassification.from_pretrained(PRETRAINED_LM,
num_labels=N_labels,
output_attentions=False,
output_hidden_states=False)
And then, we use:
-
torch.load()
to load the best_model.pt file -
load_state_dict()
to save the learned parameters to the newly initiated model
model.load_state_dict(torch.load('best_model.pt'))
test_loss, test_pred = evaluate(best_model, test_dataloader, desc= 'Testing' )
Output the classification report:
print('classifiation report')
print(classification_report(test_pred, test_df['label'].to_numpy(),target_names=label_names))
With the predictions, we can plot the confusion matrix again:
plot_confusion_matrix(test_pred,test_df['label'].to_numpy(),labels=label_names)
Output the misclassified text:
test_df['pred'] = test_pred
test_df.reset_index(level=0)
print(test_df[test_df['label']!=test_df['pred']].shape)
test_df[test_df['label']!=test_df['pred']][['text','label','pred']].head(10)