Denken
Menu
  • About Me
  • Deep Learning with Pytorch
  • Generative AI: Tutorial Series
  • Python Tutorials
  • Contact Me
Menu

Generative AI: LLMs: Finetuning Llama2 with QLoRA on custom dataset 1.5

Posted on July 27, 2023August 17, 2023 by Aritra Sen

In the last post in this series, we have gone through the inner workings of LoRA fine tuning process. In this blogpost we will use the concepts of LoRA with the quantization method. We will use the newly launched Llama2 which is one of the biggest LLM launch in the history of open-source models. Below are the steps to be used in the below given notebooks and details about each process:

  1. Install required packages.
  2. Prepare the dataset for instruction fine-tuning.
  3. Define quantization_config using BitsAndBytes.
  4. Load the Llama-2 shared model with quantization_config.
  5. Create the Llama-2 tokenizer.
  6. create the peft_config to finetune the LoRA for q,v attention metrices.
  7. Define the training arguments.
  8. create the trainer with SFTTrainer.
  9. train the model.
  10. Inference phase.

Before start coding the whole process, let’s understand few of concepts which we are yet to go through in this blog post series-

Data Preparation:
In this post we will use the dialogsum from hugging face dataset module. Dataset has 4 features which is divided in the train, test and validation dataset. Features are – [‘id’, ‘dialogue’, ‘summary’, ‘topic’]. Our interest of features are dialogue and summary. Basic we trying to fine tune our model for a text summarization task using this dataset. We will prepare the dataset in such a way that it will be used for instruction finetuning. Instruction fine-tuning uses a set of labeled examples in the form of {prompt,instruction,input,output} pairs to further train the pre-trained model in for a particular task. Below function is self-explanatory for the data preparation step.

Preparation instruction-based dataset (Credit: Author)

Quantization:
With the rapid development of LLM, its feels like every other day we are getting new LLMs with are indeed very large with lot of parameters. Most challenging aspect is to fit these models with minimum hardware requirements like a single GPU. For example, to fine-tune BLOOM-176B, you’d need 72 GPUs (8x 80GB A100 GPUs). Lot of research is going to find ways to fit these models in easily accessible Hardwares. One such way is Quantization. To understand this process let’s first understand the data types which are being used and how they are represented. Size of any model would highly be deepened on the number of parameters and the precision (float32, float16 or bfloat16) of these parameters. Idea is to reduce the model size using lower precision without affecting the model performance as shown below-

Credit: https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/
  • FP32: 8 bits are reserved for the “exponent”, 23 bits for the “mantissa” and 1 bit for the “sign” of the number. With this datatype huge range of numbers can be represented.
  • FP16: 5 bits are reserved for the exponent and 10 bits are reserved for the mantissa. Due to reduction of precision lesser range of numbers can be represented. This exposes FP16 numbers
    to the risk of overflowing (trying to represent a number that is very large) and underflowing (representing a number that is very small).
  • BF16: To tackle the problem of FP16 , BF16 has been introduced where 8 bits are reserved for the exponent (which is the same as FP32) and 7 bits are reserved for the fraction.

I hope you got an idea that using Quantization we can reduce the size of the model. There are techniques of quantization also , for more details please refer to this wonderfully written hugging face blog – https://huggingface.co/blog/4bit-transformers-bitsandbytes .We will use the bitsandbytes library to load the Llama-2 model with quantization parameters.

Quantization Config (Credit: Author)

Llama-2:
Abstract from the Llama-2 paper by Meta:

In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested,and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.
Get the details of the all the models available: Models – Hugging Face

Sharded Model:
Sharded model is helpful for distributed training of large pretrained models like LLMs. It is achieved by sharding the model parameters, gradients, and optimizer states across data parallel processes, and it can also offload sharded model parameters to a CPU. In this coding exercise we have used the shared model version of Llama-2 to work in a single GPU. You can see that 14 shards has been downloaded which initializing the model for the first time.

Sharded Llama-2 model (Credit: Author)

Peft config:
In the last blog post we discussed in detail that in LoRA we train task specific low rank adopters which are generally q,v metrices of attention layers and keep the everything of the pretrained model weights freezed. Using the peft library we will create new low rank adopters for q_proj and v_proj as shown below with given rank (r=8).

Configuring LoRA(Credit: Author)

Training Arguments:
Using Supervised Fine-tuning Trainer (huggingface.co) (SFTTrainer) we are finetuning the Llama-2 model on out custom dataset. To keep the article short please refer for training arguments details – Trainer (huggingface.co)

Inference using QLoRA Adopters:
Once the adapter is trained you can pass the saved model in the get_peft_model function along with the original model to get the new LoRA finetuned model.

Llama2 + QLoRA = Finetuned model (Credit: Author)

You can get an idea of the whole process how to fine-tune a Llama-2 model using QLoRA. I have run the model in Kaggle kernel with 1 GPU and whole process is working fine. I will work further refining the process to improve the QLoRA outputs.
Update: Code changes has been done to fix the repeating text output. Now the text summarization is working properly.

Do like, share and comment if you have any questions or suggestions.

Category: Aritra Sen, Machine Learning, Python

Post navigation

← Generative AI: LLMs: LoRA fine tuning 1.4
Generative AI: LLMs: LangChain + Llama-2-chat on Amazon mobile review dataset 1.6 →

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RSS Feeds

Enter your email address:

Delivered by FeedBurner

Pages

  • About Me
  • Contact Me
  • Deep Learning with Pytorch
  • Generative AI: Tutorial Series
  • Python Tutorials

Tag Cloud

Announcements Anrdoid BERT Bias Celebration Cricket CyanogenMod deep-learning Denken Experience Facebook Features Finetuning GCN GenerativeAI GNN Google HBOOT HBOOT downgrading HTC Wildfire huggingface India Launch Life LLM Lumia 520 MachineLearning mobile My Space nlp Orkut People Python pytorch pytorch-geometric Rooting Sachin Share Social Network tranformers transformers Tutorials Twitter weight-initialization Windows Phone

WP Cumulus Flash tag cloud by Roy Tanck and Luke Morton requires Flash Player 9 or better.

Categories

Random Posts

  • Python Tutorials – 1.2 – DataTypes Conversion and Loops
  • Android Apps-You must have
  • Python Tutorials – 1.6 – Class and Instances
  • Deep Learning with Pytorch – Custom Weight Initialization – 1.5
  • #SachinIsGod

Recent Comments

  • Generative AI: LLMs: Reduce Hallucinations with Retrieval-Augmented-Generation (RAG) 1.8 – Denken on Generative AI: LLMs: Semantic Search and Conversation Retrieval QA using Vector Store and LangChain 1.7
  • vikas on Domain Fuss
  • Kajal on Deep Learning with Pytorch -Text Generation – LSTMs – 3.3
  • Aritra Sen on Python Tutorials – 1.1 – Variables and Data Types
  • Aakash on Python Tutorials – 1.1 – Variables and Data Types

Visitors Count

AmazingCounters.com

Archives

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Copyright

AritraSen’s site© This site has been protected from copyright by copyscape.Copying from this site is stricktly prohibited. Protected by Copyscape Original Content Validator
© 2025 Denken | Powered by Minimalist Blog WordPress Theme