• ML Spring
  • Posts
  • Understanding LoRA: Low-rank Adaption of Large Language Models

Understanding LoRA: Low-rank Adaption of Large Language Models

The most powerful technique for finetuning LLMs! 🚀

LoRA (Low-Rank Adaptation) is one of the most powerful techniques when it comes to Fine-Tuning Large Language Models(LLMs).

Today I’ll clearly explain:

  • What is Lora❓

  • How does it works ❓

  • Followed by a hands-on coding tutorial❗️

But before we do that, let's set the stage with a brief overview of Fine-Tuning.

Fine-Tuning: The Traditional approach

Fine-tuning is a well-established method in machine learning where a pre-trained model is further trained (or "fine-tuned") on a specific task. This approach leverages the general knowledge the model has learned during its initial training (often on a large and diverse dataset) and adapts it to a particular use case.

Here’s a typical representation of traditional Fine-Tuning 👇

A typical finetuning process

The Need for Parameter Efficiency

While fine-tuning is powerful, it comes with a significant drawback: resource intensity. Large models require substantial computational resources, not just for training but also for adapting to new tasks.

Finetuning in the manner described above becomes a very memory intensive task, as you need to hold all the weights W as well as the gradients of the loss (∂L/∂w) with respect to these weights (we need the gradients to update the original weights) during back propagation in your GPU memory. And we are taking about LLMs which have trainable parameters in range of 100 billions (GPT-3 has 175 billion trainable parameters).

This is where parameter efficient finetuning becomes very important, otherwise we can just forget about finetuning LLMs on consumer hardware.

Now I want you to think of an alternative representation of the finetuning process as shown below👇

We are setting stage for LoRA here, let’s say if we can only update ∆W without touching the original pre trained weights W (frozen/req_grads = False), it can save us a lot of GPU memory & that too if these updated weights are only a small fraction of the original W.

Enter LoRA: A Step-by-Step Explanation:

The LoRA paper came in 2021, and one of the key insights made by this paper was that, pre-trained models have a very low intrinsic dimension (low rank), which basically means that they can represented with matrices of much lower dimension.

(rank of a matrix represents the number on linearly independent rows or columns a matrix has)

And they say that during adaptation/fine-tuning ∆W (of dimension A*B) also has a low rank and can be decompose into two matrices WA (of dimension A*r) & WB (of dimension r*B), where r is the rank of matrix ∆W.

Check this out👇

Low rank decomposition

So, let’s say if ∆W has a dimension of 100×100 and a rank of just 5, then can decompose it to two matrices of dimension of 100×5 & 5×100, which essentially means that we bring down the number of trainable parameters from 10,000 to just 1000.

Now, r becomes your hyper-parameter during training, that we need to tune. The key point is that WA and WB can effectively represent ∆W while having a significantly lower rank and, therefore, fewer parameters to fine-tune.

This is how it looks now:

Diagram of LoRA

Fine-tuning using LoRA: A hands-on coding tutorial:

Without any further adieu let’s start out coding tutorial

(I have provided link to the colab notebook at the end)

  1. We start by installing all the dependencies:

    • bitsadnbytes: Enhances neural network training efficiency, especially for NLP, through optimized operations like 8-bit optimizers to reduce memory usage.

    • datasets: Part of the Hugging Face ecosystem, this library simplifies accessing and handling a wide range of ML and NLP datasets.

    • accelerate: Streamlines the training and deployment of ML models across various hardware (CPUs, GPUs, TPUs), optimizing performance aspects like mixed precision.

    • loralib: the python library for LoRA.

!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/peft.git git+https://github.com/huggingface/transformers.git
  1. Next we import all the necessary modules & load bigscience/bloom-560m model in float.16 precision & it’s tokenizer:

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-560m",
    torch_dtype=torch.float16,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("bigscience/tokenizer")
  1. Once the model is loaded we print model summary to understand what all layers are present in the model:

print(model)
  1. In this step we freeze model’s original parameters, by setting requires_grad to False. As we loaded the model weights in float 16 precision, we typecast it’s language model head (lm head) back to float 32 for numerical stability, as these outputs will be used for loss calculation:

for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)
  1. The next is a helper function which will display trainable parameters as a very small fraction of the original parameters:

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )
  1. Now this is the most important step, where we specify the LoRA configuration and specific LLM modules (weights) for which we want to fine-tune, here we are going to target “query_key_value“ i.e. the attention blocks of the LLM (you can select the module name from model summary that we printed earlier):

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
  1. Next we load the “squad_v2” dataset, which is the Stanford question answering dataset:

from datasets import load_dataset

qa_dataset = load_dataset("squad_v2")
  1. Now, this is a crucial step where we define a prompt template to which we add the squad_v2 dataset and feed it to out LLM for finetuning, here how the dataset is formatted (It’s up to us how we define the format, it just needs to be consistent while we fine tune & do inference):

    ### CONTEXT
    {context}
    
    ### QUESTION
    {question}
    
    ### ANSWER
    {answer}</s>
def create_prompt(context, question, answer):
  if len(answer["text"]) < 1:
    answer = "Cannot Find Answer"
  else:
    answer = answer["text"][0]
  prompt_template = f"### CONTEXT\n{context}\n\n### QUESTION\n{question}\n\n### ANSWER\n{answer}</s>"
  return prompt_template

mapped_qa_dataset = qa_dataset.map(lambda samples: tokenizer(create_prompt(samples['context'], samples['question'], samples['answers'])))
  1. And we train it just like a regular transformer model, check this:

import transformers

trainer = transformers.Trainer(
    model=model,
    train_dataset=mapped_qa_dataset["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        max_steps=100,
        learning_rate=1e-3,
        fp16=True,
        logging_steps=1,
        output_dir='outputs',
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()
  1. Push the model to HuggingFace hub 🤗, interesting things here is that we only need to push the ∆W, adapted weights and it takes no time:

HUGGING_FACE_USER_NAME = "pachaar"

model_name = "bloom-560m-qa"

model.push_to_hub(f"{HUGGING_FACE_USER_NAME}/{model_name}", use_auth_token=True)
  1. Let’s fetch the model again from HF 🤗 hub and load it:

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = f"{HUGGING_FACE_USER_NAME}/{model_name}"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=False, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
qa_model = PeftModel.from_pretrained(model, peft_model_id)
  1. Finally we do inference with out fine-tuned model:

from IPython.display import display, Markdown

def make_inference(context, question):
  batch = tokenizer(f"### CONTEXT\n{context}\n\n### QUESTION\n{question}\n\n### ANSWER\n", return_tensors='pt')

  device = next(qa_model.parameters()).device
  batch = {k: v.to(device) for k, v in batch.items()}

  with torch.cuda.amp.autocast():
    output_tokens = qa_model.generate(**batch, max_new_tokens=200)

  display(Markdown((tokenizer.decode(output_tokens[0], skip_special_tokens=True))))
context = "Nadal is the best Tennis player of all time, the GOAT.

question = "Who is the GOAT of Tennis?"

make_inference(context, question)

Here’s what it outputs: Rafael Nadal, we just biased out LLM! 😉
This was just a simple example, you can try out more complex ones yourself.

A big shout-out to AbacusAI for supporting my work.

The world's first end-to-end ML and LLM Ops platform where AI, not humans, build end-to-end Applied AI agents and systems at scale.

Check this out: https://abacus.ai/

Here’s the colab Notebook!

Next, time we discuss Q-LoRA: Efficient fine-tuning of quantised LLMs!

See you! 🙂 

References:

Subscribe to keep reading

This content is free, but you must be subscribed to ML Spring to continue reading.

Already a subscriber?Sign In.Not now