How LLMs learn (3 of 3): optimization

How LLMs learn (3 of 3): optimization

We have computed gradients for all learnable parameters in the network. During optimization, these gradients are used to update the model's parameters to reduce the loss on the training data. Moving forward through the transformer, our goal at each layer is to adjust the parameters in a direction that minimizes the loss.

Optimization applies an update rule based on the gradient, learning rate, and potentially other factors like momentum or adaptive learning rates. The results at each layer are updated parameter matrices that match the shape of the original weight matrices, now adjusted to better fit the training data.

💡

Alert: Here I highlight the steps and computations of the optimization (weight updating procedure) with a numerical example to make the ideas concrete. The approach is consistent with our simplified GPT-2 model. These results are preliminary and will be validated with a formal mathematical verifier in due course.

Next page

Seedling: LLM training walkthrough

Last page

How LLMs learn (1 of 3): forward pass

Home

Psychometrics.ai