LoRA Fine-Tuning LLMs for Text Classification

Share
Share

Background

At SUSE, we build enterprise Linux distributions used around the world. Our success is deeply rooted in the Open Source community and we leverage thousands of Open Source projects to build our offerings. While this collaborative approach fosters innovation, it also presents significant legal and compliance challenges. Navigating the complex landscape of licenses, intellectual property rights, and export restrictions requires careful attention and proactive measures.

To address these challenges, we developed Cavil: an in-house tool designed to automatically scan source code for potential legal pitfalls. Cavil’s comprehensive scans cover a wide range of considerations, including license compliance (identifying the specific licenses governing each component), intellectual property concerns, export restrictions, and more. By proactively identifying these aspects early on, we can protect both SUSE and our customers from unexpected legal issues down the line.

Evolving Cavil

Currently Cavil utilizes Convolutional Neural Networks (CNNs) for legal text classification, determining whether a given code snippet or documentation contains legally relevant information (e.g., license headers). While effective, CNNs require frequent retraining to maintain accuracy as our codebase and the open source landscape evolve. This retraining process can be resource-intensive, demanding significant computational power and time.

To improve Cavil’s capabilities and reduce the need for constant retraining, we’re exploring the use of Large Language Models (LLMs). LLMs demonstrate a remarkable ability to understand context and nuance in natural language, making them well-suited for classifying complex legal text with greater accuracy. However, simply deploying an off-the-shelf LLM isn’t enough; it needs to be tailored to our specific use case, identifying legally relevant information within source code.

Introducing LoRA

This is where Low-Rank Adaptation (LoRA) comes into play. LLMs are massive, containing billions of parameters. Full fine-tuning (adjusting all those parameters) is computationally expensive and requires a vast amount of training data. LoRA offers a more efficient alternative. It freezes the original LLM’s weights and introduces a small number of trainable parameters (adapters) that learn to adapt the model’s behavior for our specific task. This significantly reduces the computational cost and data requirements while still achieving impressive results. Think of it as customizing an existing, powerful tool rather than building one from scratch.

The Cavil Dataset

To train our LoRA adapters, we need a labeled dataset, examples of code snippets or documentation marked as either containing legal text (“yes”) or not (“no”). Cavil’s architecture is designed to facilitate this process. The AI components are deployed as containerized microservices that communicate with the main application via a simple HTTP API:

$ curl --data '# SPDX-License-Identifier: GPL-2.0-only' http://127.0.0.1:5000
{"license": true, "confidence": 87.8}

This minimal API clearly defines the data requirements for training: input snippets and corresponding “yes” or “no” labels. We’ve built features into Cavil to collect this labeled data in two ways: when a human legal reviewer makes a decision, and when they correct an AI-generated classification.

[
  {
    "snippet": "# SPDX-License-Identifier: MIT",
    "is_legal_text": true
  },
  {
    "snippet": "const foo = 123;",
    "is_legal_text": false
  },
  ...
]

Our curated dataset, containing 150.000 samples, is Open Source and publicly available for download from HuggingFace.

Fine-tuning Workflow

The fine-tuning process follows a standard workflow, starting with dataset preparation. We transform our data into the Alpaca format, a widely supported structure for instruction tuning LLMs. This format consists of three key components:

  1. instruction: A prompt guiding the model’s task.
  2. input: The code snippet or documentation to be analyzed.
  3. output: The expected “yes” or “no” answer.

Here’s an example:

[
  {
    "instruction": "Analyze the code or documentation snippet enclosed in [CODE] and [/CODE] tokens to determine if it contains legal text that was written with the intention of describing how the code should be used. Answer only with yes or no.",
    "input": "[CODE]# SPDX-License-Identifier: MIT[/CODE]",
    "output": "yes"
  },
  {
    "instruction": "Analyze the code or documentation snippet enclosed in [CODE] and [/CODE] tokens to determine if it contains legal text that was written with the intention of describing how the code should be used. Answer only with yes or no.",
    "input": "[CODE]const foo = 123;[/CODE]",
    "output": "no"
  },
  ...
]

We repeat the same prompt with different inputs and outputs to create a large training set.

torchtune, a powerful fine-tuning tool we will be using, can download datasets in Alpaca format directly from HuggingFace. Our converted dataset is available there as well.

The fine-tuning process involves several steps, including setting up a virtual environment and preparing validation datasets. Here’s a simplified overview:

# Verify GPU availability and configuration
$ nvidia-smi

# Clone repository with validation scripts and config files
$ git clone https://github.com/kraih/llm-lawyer.git
$ cd llm-lawyer

# Install dependencies
$ python -m venv .venv
$ ./.venv/bin/python -m pip install -r requirements.txt

A smaller validation dataset is created to assess the model’s performance:

# Use HF to download full dataset
$ ./.venv/bin/huggingface-cli download --repo-type dataset openSUSE/cavil-legal-text --local-dir /home/sles/llm-lawyer/cavil-legal-text

# Prepare validation dataset with 500 good and bad samples
$ ./.venv/bin/python revert.py -i cavil-legal-text/legal_text.jsonl -o legaldb-ml-data
$ ./.venv/bin/python convert.py -i legaldb-ml-data -o validation.jsonl -f datasets -l 500

# Download a model and validate its base accuracy
$ ./.venv/bin/huggingface-cli download meta-llama/Llama-3.2-3B-Instruct --local-dir /home/sles/Llama-3.2-3B-Instruct
$ ./.venv/bin/python test.py -i validation.jsonl -m /home/sles/Llama-3.2-3B-Instruct

For best results you want to use a manually curated validation dataset, covering a wide range of samples that the model has not seen during training.

Key LoRA Hyperparameters

Fine-tuning involves adjusting several key hyperparameters, including the learning rate, batch size, weight decay, and LoRA adapter sizes. These parameters significantly impact training speed, model accuracy, and memory footprint. We experimented with a range of values to optimize for our specific use case. Here’s a summary of what we found:

  1. Learning Rate: Controls the step size taken when updating the LoRA adapter weights, often higher than in full fine-tuning due to fewer trainable parameters. We observed that using a learning rate between 1e-4 and 5e-5 often yielded the best results.
  2. Batch Size: A batch size of 8 to 64 worked well on our hardware. Larger batch sizes can accelerate training but require more GPU memory. We found that smaller batches often led to better generalization performance.
  3. LoRA Rank (r): Determines the dimensionality of the LoRA adapters. We tested values between 8 and 64. Higher ranks generally increase model capacity, potentially improving accuracy but also increasing the number of trainable parameters.
  4. LoRA Alpha (a): Scaling factor that adjusts how much the LoRA adapters influence the original frozen LLM’s output during training. We followed the common practice and set α equal to 2 * r.
  5. Weight Decay: Regularization technique that penalizes large weights, preventing overfitting. Not required, but most models in our experiment responded well to values between 0.01 and 0.1.

The torchtune configuration files used in our successful runs are included with the validation scripts. Depending on available hardware, the fine-tuning process itself can take between a few hours and days for each run:

# LoRA fine-tune Llama-3.2-3B-Instruct
$ ./.venv/bin/tune run lora_finetune_single_device --config experiment2/torchtune-llama-3.2-3b-instruct.yaml

# Validate accuracy of fine-tuned model snapshot
$ ./.venv/bin/python test.py -i validation.jsonl -m /home/sles/torchtune/llama-3.2-3b-instruct/lora_single_device/epoch_0/

We are actively exploring techniques like gradient accumulation and mixed precision training to further accelerate this process.

Results

Our goal for this round was to determine if smaller LLMs (1-4 billion parameters) could effectively handle our legal text classification task. Previous experiments demonstrated that LoRA fine-tuning has the potential to significantly improve accuracy, while quantization offers a path to further reduced model size. We prioritized models with licenses permitting commercial use, even if they weren’t strictly Open Source.

The following table summarizes our key findings:

Model Accuracy Size (GB) License
Llama-3.2-1B-Instruct (Baseline) 53% 2.9 Llama
Llama-3.2-1B-Instruct + LoRA (FP16) 92% 2.9 Llama
Llama-3.2-1B-Instruct + LoRA (Q8) 91% 2.0 Llama
Llama-3.2-1B-Instruct + LoRA (Q4) 73% 1.6 Llama
Llama-3.2-3B-Instruct (Baseline) 68% 6.9 Llama
Llama-3.2-3B-Instruct + LoRA (FP16) 95% 6.9 Llama
Llama-3.2-3B-Instruct + LoRA (Q8) 95% 4.2 Llama
Llama-3.2-3B-Instruct + LoRA (Q4) 93% 3.1 Llama
Qwen-2.5-1.5B-Instruct (Baseline) 64% 3.7 Apache-2.0
Qwen-2.5-1.5B-Instruct + LoRA (FP16) 92% 3.7 Apache-2.0
Qwen-2.5-1.5B-Instruct + LoRA (Q8) 92% 2.4 Apache-2.0
Qwen-2.5-1.5B-Instruct + LoRA (Q4) 92% 1.7 Apache-2.0
Qwen-2.5-Coder-1.5B-Instruct (Baseline) 46% 3.7 Apache-2.0
Qwen-2.5-Coder-1.5B-Instruct + LoRA (FP16) 94% 3.7 Apache-2.0
Qwen-2.5-Coder-1.5B-Instruct + LoRA (Q8) 93% 2.4 Apache-2.0
Qwen-2.5-Coder-1.5B-Instruct + LoRA (Q4) 89% 1.7 Apache-2.0
Phi-3-mini-4k-instruct (Baseline) 75% 9.8 MIT
Phi-3-mini-4k-instruct + LoRA (FP16) 93% 9.8 MIT
Phi-3-mini-4k-instruct + LoRA (Q8) 93% 6.7 MIT
Phi-3-mini-4k-instruct + LoRA (Q4) 91% 4.4 MIT
Gemma-2-2b-it (Baseline) 64% 5.9 Gemma
Gemma-2-2b-it + LoRA (FP16) 78% 5.9 Gemma
Gemma-2-2b-it + LoRA (Q8) 78% 4.1 Gemma
Gemma-2-2b-it + LoRA (Q4) 76% 3.1 Gemma

Fine-tuning with LoRA yielded a substantial accuracy increase compared to the baseline, validating the use of smaller models for this task. Quantization to 8-bit provided a good balance between size reduction and performance; however, quantization to 4-bit resulted in an unacceptable drop in accuracy.

We are especially happy to see Qwen-2.5-Coder-1.5B-Instruct, a model with an OSI-approved Open Source license, reach the best size to performance ratio.

Looking Ahead

Our LoRA fine-tuning experiments demonstrate the potential to significantly improve legal compliance automation, opening doors for wider applications. We believe this approach democratizes access to sophisticated legal expertise, empowering developers and organizations to navigate the complexities of Open Source compliance with greater confidence and efficiency.

Share
(Visited 1 times, 1 visits today)
Sebastian Riedel
180 views
Sebastian Riedel Master Software Engineer at SUSE