Fine-tuning Llama3-8B with SFT and ORPO

By: Kevin Vegda
Posted: April 20, 2024

Now that Llama 3 is out, I went straight to instruction-tuning it to see the kind of performance I can get. Along the way, I tried a new method for alignment called ORPO.

Like most “open-weights” models from the companies running AI, Llama 3 has some of its training details out there but not much about the data it was trained on. (other than the fact that it was 15T tokens as opposed to the now measly 2T of Llama 2)

The benchmarks look good as well - Llama 3 8B approaches and beats Llama 2 13B and even Llama 2 70B in some cases, which is crazy! This reminds me of this blog post I recently read (I can’t find the link, I’ll put it here if I find it) about the maximum amount of information you can pack into each weight of a neural network (around 2bits/weight) and quantization below 8bit gives worse performance.

ORPO is an alignment method that allows you to fine-tune pretrained models for human preference without using a reward model (kinda like DPO) but also without using an SFT model. They do this by integrating a sort of contrastive loss into the fine-tuning process.

For a start, I instruction-tuned (SFT) Llama 3 8B on 52k samples of Alpaca data generated with GPT4. Here are the results and here is the model. The results look great compared to the base model, but not compared to the instruction-tuned variant from Meta. This could be because SFT sometimes aligns models in a way that it produces more outputs that humans reject as well as select, plus Meta obviously has more alignment data and compute than we do. ORPO is an (alleged) solution to this. So right now I’m trying to see how much better an ORPO-trained Llama3-8B can do.

Eval

The base model is quite good, but on similar evals, our model seems to increase most of the scores, but it doesn't match the Instruct variant from Llama (as expected - they have much more alignment data and compute than we do). We could possibly make this better with alignment techniques like DPO or mixed SFT-and-alignment techniques like ORPO.