In the last post , we have talked about Transformer pipeline , the inner workings of all important tokenizer module and in the last we made predictions using the exiting pre-trained models.
During fine-tuning, we can adjust the weights of the model in the following two ways:
- Update the weights of the pre-trained BERT model along with the classification layer.
- Update only the weights of the classification layer and not the pre-trained BERT model. This process becomes as using the pre-trained BERT model as a feature extractor.
In this tutorial we will do the first process where where we will update the weights of the pretrained BERT model along with the classification layer. The dataset we will use is a kaggle TweetSentiment_Analysis dataset. Dataset details are given below –
It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment .
It contains the following 6 fields:
- target: the polarity of the tweet (0 = negative, 4 = positive)
- ids: The id of the tweet ( 2087)
- date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- flag: The query (lyx). If there is no query, then this value is NO_QUERY.
- user: the user that tweeted (robotickilldozr)
- text: the text of the tweet (Lyx is cool)
Next we will start writing the code-
# Importing the required Library import transformers import torch import numpy as np from torch.nn import functional as F import pandas as pd import tqdm
# Reading the dataset with no columns titles and with latin encoding df_raw = pd.read_csv('../input/sentiment140/training.1600000.processed.noemoticon.csv', encoding = "ISO-8859-1", header=None) # As the data has no column titles, we will add our own df_raw.columns = ["label", "time", "date", "query", "username", "text"] # Show the first 5 rows of the dataframe. # You can specify the number of rows to be shown as follows: df_raw.head(10) df_raw.head()
# Checking label column distribution , The label '4' denotes positive sentiment and '0' denotes negative sentiment df_raw['label'].value_counts()
0 800000 4 800000 Name: label, dtype: int64
# keeping only the text and the label, as we won't need any of the other columns df = df_raw[['label', 'text']] label_dict = {4 : 1 , 0 : 0} # mapping label 4 as class 1 df.loc[:,'label'] = df['label'].map(label_dict) df.head()
# doing Train/Test split from sklearn.model_selection import train_test_split train_texts, val_texts, train_labels, val_labels = train_test_split(df['text'].values, df['label'].values, test_size=.2) # Importing the pre trained tokenizer from transformers import DistilBertTokenizerFast , AutoTokenizer tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
Next we will create the train/validation input ids and attention masks and convert them to torch tensors. If you are wondering about , why we are looping through the data and then doing the tokenizing each text , the reason is if you try to do it at one go you may face out of memory error.
# Creating train input ids and attention masks train_input_ids = [] train_attention_mask = [] for text in tqdm.tqdm(train_texts): encoding = tokenizer.encode_plus( text, add_special_tokens=True, max_length=64, padding = 'max_length', truncation = True, return_attention_mask= True, return_tensors='pt') train_input_ids.append(encoding['input_ids']) train_attention_mask.append(encoding['attention_mask']) train_input_ids = torch.cat(train_input_ids,dim=0) train_attention_mask = torch.cat(train_attention_mask,dim=0)
100%|██████████| 1280000/1280000 [04:32<00:00, 4693.16it/s]
val_input_ids = [] val_attention_mask = [] for text in tqdm.tqdm(val_texts): encoding = tokenizer.encode_plus( text, add_special_tokens=True, max_length=64, padding = 'max_length', return_attention_mask= True, truncation = True, return_tensors='pt') val_input_ids.append(encoding['input_ids']) val_attention_mask.append(encoding['attention_mask']) val_input_ids = torch.cat(val_input_ids,dim=0) val_attention_mask = torch.cat(val_attention_mask,dim=0)
100%|██████████| 320000/320000 [01:04<00:00, 4981.57it/s]
Next we will create train and test Torch TensorDataset by combining the inputs ids , attention masks and lables. You need to keep in mind that labels has to be of type torch.long datatype which transformers library requires. After that we will create torch DataLoaders for batch processing.
train_dataset = torch.utils.data.TensorDataset(train_input_ids, train_attention_mask, torch.tensor(train_labels,dtype=torch.long)) val_dataset = torch.utils.data.TensorDataset(val_input_ids, val_attention_mask, torch.tensor(val_labels,dtype=torch.long)) train_loader = torch.utils.data.DataLoader(train_dataset,shuffle=True,batch_size=32) val_loader = torch.utils.data.DataLoader(val_dataset,shuffle=False,batch_size=32)
Next we will import the pre-trained distilbert model and also the AdamW optimizer from transformers library.
from transformers import DistilBertForSequenceClassification , AdamW device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased') model.to(device) model.train() optimizer = AdamW(model.parameters(), lr=5e-5)
# function to print the model performence from sklearn.metrics import f1_score,accuracy_score def calculate_model_performence(labels,prediction): print('F1 Score:',f1_score(labels,prediction)) print('Accuracy :',accuracy_score(labels,prediction))
Training Loop with one epoch
batch_labels = [] batch_prediction = [] for batch in tqdm.tqdm(train_loader): optimizer.zero_grad() input_ids = batch[0].to(device) attention_mask = batch[1].to(device) labels = batch[2].to(device) outputs = model(input_ids, attention_mask=attention_mask, labels=labels) loss = outputs[0] preds = torch.argmax(outputs['logits'],dim=1) loss.backward() optimizer.step() batch_labels.extend(labels.cpu().numpy()) batch_prediction.extend(preds.cpu().numpy())
100%|██████████| 40000/40000 [1:09:36<00:00, 9.58it/s]
# Check the model performence on the training dataset print(calculate_model_performence(batch_labels,batch_prediction))
F1 Score: 0.8456842439708853 Accuracy : 0.84594453125
# Validation loop performence batch_labels = [] batch_prediction = [] model.eval() for batch in tqdm.tqdm(val_loader): input_ids = batch[0].to(device) attention_mask = batch[1].to(device) labels = batch[2].to(device) with torch.no_grad(): outputs = model(input_ids, attention_mask=attention_mask, labels=labels) preds = torch.argmax(outputs['logits'],dim=1) batch_labels.extend(labels.cpu().numpy()) batch_prediction.extend(preds.cpu().numpy())
100%|██████████| 10000/10000 [05:13<00:00, 31.90it/s]
print(calculate_model_performence(batch_labels,batch_prediction))
F1 Score: 0.8497619467466055 Accuracy : 0.856225
Running the training loop for more iteration can improve the performance on the validation set. In the next post we will talk how to transformer in built process to fine tune the library and how to update only the weights of the classification layer and not the pre-trained BERT model. This process becomes as using the pre-trained BERT model as a feature extractor.
Thanks for reading and please comment if you have any questions.