In this post we will go through about how to do hands-on implementation with Hugging Face transformers library for solving few simple NLP tasks, we will mainly talk about hands-on part , in case you are interested to learn more about transformers/attention mechanism , below are few resources –
Getting started with Google Bert
Neural Machine Translation
Illustrated Transformer
Few of the prerequisites for this posts are –
- Basic knowledge of transformer architecture
- Deep learning
- Basics of Pytorch implementation
So if you have pytorch/tensorflow installed in your system , then you can straightway install Hugging Face transformers library with below command –
! pip install transformers
At first we will start with very basics of transformers , what is pipeline , inner workings of tokenizers , how to use pre-trained models, how to do sentence classification with few lines of code using pre-trained models.Then in the next post we will fine tune a pre-trained BERT model in Pytorch for twitter sentiment analysis.
Let’s first import the required libraries
import transformers import torch import numpy as np from torch.nn import functional as F import pandas as pd import tqdm
Pipelines
We will go through the pipeline component of transformers , The pipelines are a great and easy way to use models for inference.
Pipelines are made of:
- A tokenizer in charge of mapping raw textual input to token.
- A model to make predictions from the inputs.
- Some (optional) post processing for enhancing model’s output.
Next we will try to sentiment classification with pipeline
classifier = transformers.pipeline('sentiment-analysis') # mention the pipeline name result = classifier(['We are learning tranformers']) print(result)
[{'label': 'POSITIVE', 'score': 0.9731776118278503}]
In the above snippet with pipeline abstraction we can pass the task name to perform any task with minimal coding. Classifier output returns the label with probability score. Below are few of the supported tasks available currently –
"audio-classification"
"automatic-speech-recognition"
"image-classification"
"question-answering"
"text-classification" (alias "sentiment-analysis" available)
We can also do the same with multiple sentences in below shown way –
results = classifier(['We are learning tranformers', 'I am not happy']) for result in results: print(result)
{'label': 'POSITIVE', 'score': 0.9731776118278503} {'label': 'NEGATIVE', 'score': 0.9997896552085876}
We can also pass a pre-trained model name and pass it as argument to the pipeline for sentiment analysis classification task. Here we will get the same result as under the hood we are using same model in the previous step and in this current step.
model_name = 'distilbert-base-uncased-finetuned-sst-2-english' classifier = transformers.pipeline('sentiment-analysis' , model=model_name) results = classifier(['We are learning tranformers', 'I am not happy']) for result in results: print(result)
{'label': 'POSITIVE', 'score': 0.9731776118278503} {'label': 'NEGATIVE', 'score': 0.9997896552085876}
Based the task we want to perform , we can select different types of models. Using this documentation link you can choose the model or check point name.
We can also choose a custom tokenizer along with model inside pipeline to do the classifiction task as shown below , we will again get the same result as both tokenizer and model are same from previous tasks. Generally , to match the architecture and other properties we use the same model_name for tokenizer and model. from_pretrained method we use to fetch details from per-trained models.
from transformers import AutoTokenizer ,AutoModelForSequenceClassification model_name = 'distilbert-base-uncased-finetuned-sst-2-english' tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) classifier = transformers.pipeline('sentiment-analysis' , model=model, tokenizer=tokenizer) results = classifier(['We are learning tranformers', 'I am not happy']) for result in results: print(result)
{'label': 'POSITIVE', 'score': 0.9731776118278503} {'label': 'NEGATIVE', 'score': 0.9997896552085876}
Tokenizer:
Now we will discuss few of the inner working of Hugging Face tokenizer process. Below are few of the things to keep in mind –
- Basic tokens as output from the tokenizer. we will use the previously created tokenizer object for the below steps –
tokens = tokenizer.tokenize('We are learning tranformers') print(tokens)
['we', 'are', 'learning', 'tran', '##form', '##ers']
2. We get token_ids from tokens from tokenizer object , with this process each token gets mapped to unique token ids (Examples ‘we’ -> 2057)
token_ids = tokenizer.convert_tokens_to_ids(tokens) print(token_ids)
[2057, 2024, 4083, 25283, 14192, 2545]
3. Now we look at Tokenizing a complete sentence and printing the corresponding output. If you closely below outputs we can see two things have been printed , input_ids which is same as above token_ids apart from two extra tokens to mark the beginning and end of sentence. attention_mask tells us whether the tokens are padding or not , if the input_ids are not padding then the corresponding values are 1 and if its padding then the value would be 0.
print(tokenizer('We are learning tranformers'))
{'input_ids': [101, 2057, 2024, 4083, 25283, 14192, 2545, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
4. We work with batches of sentences , then we need to mindful that each sentence may not be or same length to tackle the same we need enable few options which doing the tokenization process.
padding=True – to pad the sentences of smaller length while compared to the biggest sentence in the batch.
truncation=True – will truncate the sentence to given max_length
return_tensors=’pt’ – will return the token byt converting them to pytorch tensors.
data = ['We are learning tranformers', 'I am not happy' , 'We are happy to learn transformers'] batch = tokenizer(data,padding=True,truncation=True,max_length=512,return_tensors='pt') print(batch)
{'input_ids': tensor([[ 101, 2057, 2024, 4083, 25283, 14192, 2545, 102], [ 101, 1045, 2572, 2025, 3407, 102, 0, 0], [ 101, 2057, 2024, 3407, 2000, 4553, 19081, 102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1]])}
Making Predictions with pre-trained models for a batch
We are using the model defined in earlier to make predictions on the batch of data we just created in previous step, inline comments are provided for explanation.
with torch.no_grad(): # disable gradient computation results = model(**batch , labels=torch.tensor([1,0,1])) # without labels loss won't be printed print(results) predictions = torch.softmax(results.logits,dim=1) # normalization of logits print(predictions) classes = torch.argmax(predictions,dim=1) #taking argmax to select the class with highest probability print(classes) labels = [model.config.id2label[c] for c in classes.tolist()] # pretrained models has id2label property to get the class names print(labels)
SequenceClassifierOutput(loss=tensor(0.0092), logits=tensor([[-1.8314, 1.7600], [ 4.7199, -3.7467], [-4.1593, 4.4165]]), hidden_states=None, attentions=None) tensor([[2.6822e-02, 9.7318e-01], [9.9979e-01, 2.1033e-04], [1.8858e-04, 9.9981e-01]]) tensor([1, 0, 1]) ['POSITIVE', 'NEGATIVE', 'POSITIVE']
In the next post then we will put together everything and fine tune a pre-trained BERT model in Pytorch and in HuggingFace in built process for twitter sentiment analysis.
Thanks for reading and please comment if you have any questions.