With the forward pass complete, new tokens can be generated during inference (no back propagation), or new learning begins (with back propagation). During back propagation, the final error from the forward pass output is traced back through the network to see how each layer contributed to the loss.
At each layer, the chain rule is applied to express the loss gradient (the partial derivative of the loss with respect to activations or parameters) in terms of that layer’s inputs, outputs, and parameters. This produces the gradients needed for the final stage, optimisation, where the weights are updated.
Target specification
We must first define the target to calculate loss. In this case, the target is the input sequence shifted one position forward, so that each token is trained to predict the next token in the sequence. For our toy example the target is for the first position (predicting the next token after “AI”) and nothing further for the second position, since our vocabulary did not define an explicit end-of-sequence (EOS) token. For our toy example, the target sequence in matrix form is specified as follows.
Loss function
With the target specified, we define the loss function: it quantifies how far the model’s predicted next-token probabilities are from the true targets. In autoregressive language modelling we use cross-entropy (negative log-likelihood). This is the negative log probability of the correct token at each position, averaged over positions with a target (positions without a target are skipped).
where is the number of target tokens, , is the true token at position , and is the model’s predicted probability for that token. With the target specified, the loss function defined, and loss calculated we are ready to go through the back propagation process for our numerical worked example.
Worked example
We will now calculate the loss for our toy example by applying this definition to the model’s predicted probabilities and the specified target sequence. The softmax probabilities from the forward pass were as follows.
is the matrix of predicted probabilities, where each row corresponds to a position in the input sequence and each column corresponds to a token in the vocabulary. In our toy example, the first row gives the probabilities for predicting “AI” or “rocks” after the token “AI,” and the second row gives the probabilities for predicting “AI” or “rocks” after the token “rocks.” The target is = .
Remainder of this section is coming soon.
Next page
Seedling: LLM training walkthrough
Last page
How LLMs learn (1 of 3): forward pass
Home