AI Alignment: Technical Overview and Scalability Challenges

What it is
Why it matters
How to do it
Stage 1. Supervised fine-tuning
Stage 2. Reward modeling
Stage 3. Proximal Policy Optimization
Reward hacking means this is just the start
Alternative alignment approaches
References

What it is

Alignment in AI describes actions taken by Large Language Model (LLM) developers to help models that were optimized for next token prediction to behave more in line with human expectations. This might mean producing less toxic content and giving less incorrect information. Alignment efforts usually come with a tax in the form of a reduction in a model’s predictive accuracy, but in return we get models that are better aligned with human values.

Alignment is so important that it is receiving media attention. In this Financial Times interview, Geoffrey Hinton discusses the existential threat that the prospect of AI super intelligence poses. This Guardian article discusses the lengths model providers go to in order to ensure their models are aligned. Steps include collaborations between OpenAI and Anthropic where they scrutinise each other’s models and reach concerning conclusions.

Why it matters

While this topic is usually the domain of the big AI labs, there are two reasons psychometrics practitioners should be aware of how alignment works.

First, alignment has psychometric applications that could work in modest resource contexts, such as bias mitigation in scoring of constructed responses. Human feedback might highlight that intentions behind constructed responses to a situational judgment test are the same, despite different language used by majority and minority groups to express intent.

Second, psychometric practitioners may be able top propose new alignment methods on smaller models that, if shown systematically effective on a small scale, would be worth scaling to larger models with hundreds of billions of parameters, despite not reaching commercial grade themselves.

How to do it

One of the common approaches in the LLM context is Reinforcement Learning from Human Feedback (RLHF), which was adapted to the LLM context by Ouyang et al. (2022). The aim is to align the LLM outputs with human preferences by altering model weights in response to human feedback using a process that efficiently scales.

At the same time, we want to minimise the divergence between the updated model’s output token distribution and that of the supervised fine-tuning (SFT) reference model, so that it’s still accurately predicting tokens. The backbone of RLHF involves three stages, SFT, reward modelling, and policy updates.

Stage 1. Supervised fine-tuning

SFT starts with an LLM that’s been pre-trained and undertakes further training from a non-zero starting point. SFT training inputs are tuples (prompt, response), trained with cross-entropy loss. The earlier Seedling case study shows this from the pre-training process.

The data used for SFT often comes from specific domains and should contain high-quality prompt response pairs. This stage is important, it moves the LLM from a fluent text generator to one that can also follow prompts, giving responses that are in the right ballpark.

Stage 2. Reward modeling

In an ideal world, we’d have infinite examples of strong responses from a human perspective that we could use to fine tune the LLM. However, we rarely have enough examples for fine-tuning alone to be enough to align the model to human preferences. In addition, using humans to evaluate LLM outputs is too costly to scale economically.

AI generates responses, humans rank preferences

A clever workaround is to build and apply reward model to update the LLM parameters. We get the LLM (now referred to as the ‘Policy’) to generate responses pairs to a set of AI or Human generated prompts (this process can be generalized to more responses per prompt). Humans are then asked to rank their preferences regarding the options. These preferences are the ‘Rewards’ for the different responses to a given prompt.

Next a model is built to predict human preferences for newly generated prompt and response pairs that we don’t know the human preferences for. There are two important parts to the reward model, building and freezing the model parameters and the model’s application in subsequent updates to the LLM parameters (i.e., the ‘Policy’). The inputs for the reward model build are tuples consisting of a prompt, two responses, and a binary preference indicator.

Predict human preferences for unseen responses

The LLM used to predict the scalar preferences is a neural network model, often based on the reference LLM that has a special head. It takes a prompt-response pair as an input and produces a scalar reward as output. It is trained using pairwise loss based on the Bradley Terry model, so the model assigns higher values to human preferred responses.

The reward model processes each prompt-response pair as a sequence of hundreds of token embeddings that are passed through transformer layers before outputting a single scalar reward score. The reward model is then ‘frozen’ so it can act as a stable proxy for human preferences. Without freezing the optimisation target would drift.

Stage 3. Proximal Policy Optimization

When a new response is generated, its predicted human rating is used to determine how the prompt and response pair is used to adjust the model weights. The most common method is Proximal Policy Optimization (PPO) updates. PPO updates increase the log-probability of high-reward responses and decrease it for low-reward ones.

The model inputs are tuples are consisting of a prompt, response, and reward, with the reward coming from the frozen reward model plus a constraint that prevents the update from moving the policy (or LLM parameters) too far from the original baseline LLM (or ‘Reference policy’).

The constraint is based on Kullback-Leibler divergence, which measures the difference between the token output distributions before and after policy updates. This prevents the model from aggressively optimizing on the reward and losing its ability to predict tokens. Repeated updates shift the policy distribution toward responses that align with human preferences.

Reward hacking means this is just the start

Alignment is considered extremely important because of the potentially existential implications of AI systems behaving in ways misaligned with human values (Bostrom, 2014; Amodei et al., 2016). Reward hacking refers to concern that AI systems will learn to hack the reward function or secure the approval of humans in the RLHF loop in ways that do not involve producing harmless outputs.

The model might discover, for instance, the sycophantic responses secure high human approval ratings regardless of the content of their messaging and focus on flattering the human raters. In other words, the LLM learns to exploit human fallibilities instead of aligning its output to human values as intended. As the complexity and capability of models grows, the possibilities that reward models can be hacked increases dramatically.

The problems reward hacking causes means that RLHF and related methods are only the beginning of what is required for AI alignment rather than the end of the road.

Alternative alignment approaches

As in many areas of psychometrics, possibilities abound within these three core stages. It is possible to use other preference scaling methods in the reward model, such as direct human ratings of response quality. It is common to use smaller model variations in the construction of the reward model to balance accurate reward prediction with economic cost.

Emerging approaches replace the Human in RLHF with AI to implement RLAIF. There are also alternative methods altogether in alignment, such as Constitutional AI (Bai et al. 2022) where the LLM itself is used to critique its own responses according to a set of criteria (the ‘Constitution’) and Direct Policy Optimization (DPO), which encodes the reward in the model formulation directly, allowing you to skip creation of the reward model altogether.

References

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J. and Mané, D., 2016. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.

Bostrom, N. (2014) Superintelligence: Paths, Dangers, Strategies. Oxford: Oxford University Press.

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., ... & Kaplan, J. (2022). Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.

Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang et al. "Training language models to follow instructions with human feedback." Advances in neural information processing systems 35 (2022): 27730-27744.

Next page

Emergent LLM capabilities

Previous page

Fine-tuning LLMs: Adapters, LoRA, QLoRA, and related methods

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

Alignment: what it is, why it matters, how to do it