In the last post in this Generative AI with LLMs series we talked about different types of LLM model and how they are generally pre-trained. These Deep Learning language models with large numbers of parameters are generally trained on open-sourced data like Common Crawl, The Pile, MassiveText, blogs, Wikipedia, GitHub etc. These datasets are generally from different domains and different topics which are generic in nature. However, these LLMs may not perform as well on specific task in hand without finetuning. For example, if you want to use a pretrained LLM for NLP tasks or bio-medical documents/texts, finetuning (or in context learning) it on a corpus of bio-medical documents can significantly improve the model’s performance.
This blog posts will try to show and discuss different available approaches to do the finetuning for a task in hand. We will discuss each of the topics briefly in this post and in future post we go in depth to the topics in detail with code.
Let’s get started with brief discussion on each of the above shown approaches:
In context learning:
In context learning is way to go when we don’t have access to the full LLM and we are using an API to access the LLM for example when we are using OpenAI gpt-35-turbo model to make API calls. We have seen that with the recent development of GPT 3/ Chat GPT we can do zero shot prompting or few-shot prompting to get better results when we use these models for a task in hand. In few shot prompting we provide one or multiple examples of the task embedded in the input prompt to the model. In the next blog post tutorial, we will go through we can do it in python.
Feature based finetuning:
In feature based finetuning we should have access to the full LLM model like BERT which can be finetuned in generally two of the approaches to get much better performance on domain specific downstream task like sentiment classification. In Feature base finetuning we can attach a new classification or task specific head and the newly added task specific head can be trained or we can also tune all the layers of LLM along with the newly added head. More in depth discussion to be done in the future blog posts.
Quantization (Model run time/space optimization):
Quantization is generally a model optimization technique. In the feature base finetuning we generally play with the number of params to fine tune, Quantization takes a different approach where we try to represent the weights, biases and gradients of the LLMs with low precision data types like 8-bit integer (INT8) instead of the usual 32-bit floating point (FP32). By reducing the number of bits, we can reduce the size of the model to be finetuned, which in turn can help in the finetuning process by reducing the memory and run time.
Multitask instruction finetuning:
So far, all the above methods mentioned above talks about finetuning the whole LLM for a single downstream task. Tuning the model for a single downstream task can lead to the phenomenon named ‘Catastrophic forgetting’ where learns to do the task for which it was finetuned however it performs very poorly on the other tasks. For example, a LLM finetuned for sentiment classification can start performing very poorly on other tasks like text summarization, named entity recognition. To avoid catastrophic forgetting we can fine tune the model on mixture of instruction prompts. FLAN group of models like FLAN-T5 is one such model.
Parameter Efficient Fine Tuning (PEFT):
PEFT reuses the pre-trained model with minimum new params to be trained using the fine-tuning process. Go for PEFT when you want to optimize the below mentioned criteria:
- computational costs and hardware (requires fewer GPUs and GPU time)
- Minimum training time
- Better modeling performance by reducing overfitting)
- Requires less storage as newly added or trained params are very minimum in size.
On a high level we can categorize the PEFT in below mentioned approaches, which we would discuss in detail in with code implementation in future blog posts.
- LoRA (reparameterization):
In Low Rank Adaption LLM fine tuning new two low rank metrices (using matrix decomposition) gets introduced to the fine-tuning process keeping the pre-trained model weights unchanged. These low rank metrices can be specific to different tasks. During inference time these task specific low rank metrices can be added back to the pre-trained model weights to get better performance on individual tasks.
2. Adapters:
According to research a BERT model trained with the adapter method reaches a similar modeling performance comparable to a fully finetuned BERT model while only requiring the training of 3.6% of the parameters. This method adds additional parameters/layers to each transformer block and trains only these additional parameters keeping original parameters frozen.
3. Soft prompt tuning:
Soft prompt tuning is different than traditional prompt tuning, in soft prompt tuning or prefix tuning prepends tunable tensors to the embeddings on the input.
Reinforcement Learning with Human feedback (RLHF):
In this process we include human feedback in the loop of fine tuning. In this process at first step is to generate human labelled dataset where human rank the LLM outputs based on certain criteria like toxicity/relevance/quality of output. Using this labelled dataset, a reward model in trained which gives reward to the outputs generated by LLM. Based on the reward a RL algorithm (proximal policy update) fine tunes the weights of the LLMs. This technique is one of the main reasons behind the success of ChatGPT.
In the future blog post we will try to get into depth of these techniques.
Thanks for your time, hope you have enjoyed reading, do share if you like the post.