Reinforcement learning from human feedback (RLHF) is a method for fine-tuning a model to better reflect human preferences. Abstractly, it consists of two parts: the target model we want to fine-tune and a preference model trained to produce a reward from the target modelโ€™s inputs and outputs.

The preference model should accurately reflect human preference, assigning high scalars to good outputs and low scalars to bad ones. To train this model, we commonly follow a comparison ELO system: given an input and two outputs (with a label of which oneโ€™s better), update the preference model to assign higher scalar to the better output.

Then, we fine-tune the target model using โ™Ÿ๏ธ Reinforcement Learning algorithms (like ๐Ÿ“ช Proximal Policy Optimization), updating its weights by treating the preference modelโ€™s output as our reward. To avoid the target model from โ€œgamingโ€ the reward by exploiting unintended outputs, we can also incorporate a divergence loss to keep the fine-tuned output similar to the original output; this gives us a reward function

LLMs

One of the most impactful applications of RLHF is in training ๐ŸŽค Large Language Models. Both the target and preference models are usually some pre-trained model; the input is the prompt, and the output is the generated text.