1.1 – Fine Tune a Transformer Model (1/2)

In the last post , we have talked about Transformer pipeline , the inner workings of all important tokenizer module and in the last we made predictions using the exiting pre-trained models.

During fine-tuning, we can adjust the weights of the model in the following two ways:

Update the weights of the pre-trained BERT model along with the classification layer.
Update only the weights of the classification layer and not the pre-trained BERT model. This process becomes as using the pre-trained BERT model as a feature extractor.

In this tutorial we will do the first process where where we will update the weights of the pretrained BERT model along with the classification layer. The dataset we will use is a kaggle TweetSentiment_Analysis dataset. Dataset details are given below –

It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment .

It contains the following 6 fields:

target: the polarity of the tweet (0 = negative, 4 = positive)
ids: The id of the tweet ( 2087)
date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
flag: The query (lyx). If there is no query, then this value is NO_QUERY.
user: the user that tweeted (robotickilldozr)
text: the text of the tweet (Lyx is cool)

Next we will start writing the code-

# Importing the required Library
import transformers
import torch
import numpy as np
from torch.nn import functional as F
import pandas as pd
import tqdm

# Reading the dataset with no columns titles and with latin encoding 
df_raw = pd.read_csv('../input/sentiment140/training.1600000.processed.noemoticon.csv', encoding = "ISO-8859-1", header=None)

 # As the data has no column titles, we will add our own
df_raw.columns = ["label", "time", "date", "query", "username", "text"]

# Show the first 5 rows of the dataframe.
# You can specify the number of rows to be shown as follows: df_raw.head(10)
df_raw.head()

# Checking label column distribution , The label '4' denotes positive sentiment and '0' denotes negative sentiment
df_raw['label'].value_counts()

0    800000
4    800000
Name: label, dtype: int64

# keeping only the text and the label, as we won't need any of the other columns
df = df_raw[['label', 'text']]

label_dict = {4 : 1 , 0 : 0}

# mapping label 4 as class 1
df.loc[:,'label'] = df['label'].map(label_dict)

df.head()

# doing Train/Test split
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(df['text'].values, df['label'].values, test_size=.2)

# Importing the pre trained tokenizer
from transformers import DistilBertTokenizerFast , AutoTokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Next we will create the train/validation input ids and attention masks and convert them to torch tensors. If you are wondering about , why we are looping through the data and then doing the tokenizing each text , the reason is if you try to do it at one go you may face out of memory error.

# Creating train input ids and attention masks
train_input_ids = []
train_attention_mask = []
for text in tqdm.tqdm(train_texts):
    encoding = tokenizer.encode_plus(
    text,
    add_special_tokens=True,
    max_length=64,
    padding = 'max_length',
    truncation = True,
    return_attention_mask= True,
    return_tensors='pt')
    train_input_ids.append(encoding['input_ids'])
    train_attention_mask.append(encoding['attention_mask'])
train_input_ids = torch.cat(train_input_ids,dim=0)
train_attention_mask = torch.cat(train_attention_mask,dim=0)

100%|██████████| 1280000/1280000 [04:32<00:00, 4693.16it/s]

val_input_ids = []
val_attention_mask = []
for text in tqdm.tqdm(val_texts):
    encoding = tokenizer.encode_plus(
    text,
    add_special_tokens=True,
    max_length=64,
    padding = 'max_length',
    return_attention_mask= True,
    truncation = True,
    return_tensors='pt')
    val_input_ids.append(encoding['input_ids'])
    val_attention_mask.append(encoding['attention_mask'])
val_input_ids = torch.cat(val_input_ids,dim=0)
val_attention_mask = torch.cat(val_attention_mask,dim=0)

100%|██████████| 320000/320000 [01:04<00:00, 4981.57it/s]

Next we will create train and test Torch TensorDataset by combining the inputs ids , attention masks and lables. You need to keep in mind that labels has to be of type torch.long datatype which transformers library requires. After that we will create torch DataLoaders for batch processing.

train_dataset = torch.utils.data.TensorDataset(train_input_ids,
                                               train_attention_mask,
                                               torch.tensor(train_labels,dtype=torch.long))
val_dataset = torch.utils.data.TensorDataset(val_input_ids,
                                             val_attention_mask,
                                             torch.tensor(val_labels,dtype=torch.long))


train_loader = torch.utils.data.DataLoader(train_dataset,shuffle=True,batch_size=32)
val_loader = torch.utils.data.DataLoader(val_dataset,shuffle=False,batch_size=32)

Next we will import the pre-trained distilbert model and also the AdamW optimizer from transformers library.

from transformers import DistilBertForSequenceClassification , AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(device)

model.train()

optimizer = AdamW(model.parameters(), lr=5e-5)

# function to print the model performence
from sklearn.metrics import f1_score,accuracy_score
def calculate_model_performence(labels,prediction):
    print('F1 Score:',f1_score(labels,prediction))
    print('Accuracy :',accuracy_score(labels,prediction))

Training Loop with one epoch

batch_labels = []
batch_prediction = []
for batch in tqdm.tqdm(train_loader):
    optimizer.zero_grad()
    input_ids = batch[0].to(device)
    attention_mask = batch[1].to(device)
    labels = batch[2].to(device)
    outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
    loss = outputs[0]
    preds = torch.argmax(outputs['logits'],dim=1)
    loss.backward()
    optimizer.step()
    batch_labels.extend(labels.cpu().numpy())
    batch_prediction.extend(preds.cpu().numpy())

100%|██████████| 40000/40000 [1:09:36<00:00,  9.58it/s]

# Check the model performence on the training dataset
print(calculate_model_performence(batch_labels,batch_prediction))

F1 Score: 0.8456842439708853
Accuracy : 0.84594453125

# Validation loop performence

batch_labels = []
batch_prediction = []
model.eval()
for batch in tqdm.tqdm(val_loader):
    input_ids = batch[0].to(device)
    attention_mask = batch[1].to(device)
    labels = batch[2].to(device)
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
    preds = torch.argmax(outputs['logits'],dim=1)
    batch_labels.extend(labels.cpu().numpy())
    batch_prediction.extend(preds.cpu().numpy())

100%|██████████| 10000/10000 [05:13<00:00, 31.90it/s]

print(calculate_model_performence(batch_labels,batch_prediction))

F1 Score: 0.8497619467466055
Accuracy : 0.856225

Running the training loop for more iteration can improve the performance on the validation set. In the next post we will talk how to transformer in built process to fine tune the library and how to update only the weights of the classification layer and not the pre-trained BERT model. This process becomes as using the pre-trained BERT model as a feature extractor.

Thanks for reading and please comment if you have any questions.

Category: Aritra Sen, Denken, Machine Learning, Python

Training Loop with one epoch

Leave a Reply Cancel reply