In this section, we continue exploring the transformer training process, using large language models (LLMs) as an expository vehicle and building on the tokenization chapter. Here we discuss a worked example of the forward pass, and in the next chapters we cover back propagation, and optimization.
Our objective is to manually work through a full GPT-2 style block to see exactly how embeddings, multi-head attention, residuals, layer norms, the MLP, and the projection to logits interact and then to calculate loss, compute the gradients via back propagation, and update weights using optimisation methods.
Here we cover the forward pass, which happens during inference and during the first stage of learning. The forward pass processes the text input through the network architecture to produce probabilities for the next tokens, where the targets are the same sequence shifted one position forward so that each input token is trained to predict the following token. Later sections cover back propagation, and optimization (weight updating).
Alert: Here I highlight the steps and computations of the forward pass procedure with a numerical example to make the ideas concrete. The approach is consistent with our simplified GPT-2 model. These results are preliminary and will be validated with a formal mathematical verifier in due course.
We will use three dimensions, which gives us room to represent two input tokens (‘AI’ and ‘rocks’) plus an EOS token for proper token handling. It also ensures that LayerNorm and other stages do not collapse the gradient during backpropagation. This makes the toy both realistic and pedagogically useful. We use L = 1 layer, dm=4 token embedding dimensions and h=2 heads and otherwise stay faithful to GPT-2.
We’ll use the smallest meaningful vocabulary and input sequence that demonstrates the workings of the transformer model while also including EOS handling.
- Vocab size V=3
- Tokens: "AI" → ID 0, "rocks" → ID 1, "<EOS>" → ID 2
- Input sequence:
[AI, rocks, <EOS>]→[0, 1, 2] - Target sequence:
[rocks, <EOS>, <nothing>]→[1, 2, –]
The model learns an embedding matrix E∈RV×dm=R3×4with random initialization
- E is the embedding matrix (size V×dm).
- Each row corresponds to one token in the vocabulary.
- Multiplying a one-hot vector et for token t (length V) by E selects the matching rows or equivalently the row can be selected via lookup. Here we select all rows which are dense vector representations of the tokens.
- In our example, we’ll set the initial values before learning as small, zero mean decimals, the first row is for token 0 AI, the second is token 1 rocks, and the third row is for token 2 “<EOS>”
The model learns a positional embedding matrix P∈RL×dm
- P is the embedding matrix (size L x dm).
- Initially, the length of L is the maximum sequence length the model can handle.
- Each of the rows corresponds to a position in the input sequence. Again, we’ll use small, zero mean decimals in our example.
- Element wise addition of E and P, xi=E[tokeni]+P[positioni], creates the tensor that enters the first transformer block.
LayerNorm rescales each token vector to have a mean 0 and variance 1. This prevents arbitrary embedding magnitudes from dominating downstream calculations due to differences in their scale.
- Compute mean and variance of each vector and normalize vector elements.
- After normalization, each element x^ij is rescaled and shifted by learnable parameters γj (scale) and βj (shift). They are initialized as γj=1 and βj=0, so LayerNorm starts as pure normalization. During training, γj can grow or shrink each dimension (scaling) and βj can move it up or down (shifting), giving the model flexibility to adapt features for the next layers. A small constant ε is always included in the denominator to prevent division by zero when the variance happens to be exactly zero (PyTorch uses ε=1e−5), in our worked example it doesn’t affect the numbers.
- For our worked example, we now numerically demonstrate standardisation, scale and shift to produce the input tensor for self attention.
Self-attention transforms the normalized input into query, key, and value vectors, uses dot products of queries and keys to calculate attention scores, normalizes these scores with softmax to produce attention weights, and then applies these weights to the values to produce the attention output. Note that in GPT-2, bias terms are added to the linear projections (Q, K, V, and the output). These are per-feature offsets that don’t affect the core tensor mechanics, so they are omitted in this worked example for clarity. In the full GPT-2 model, however, each projection includes a learned bias vector with the same output dimension as its corresponding matrix (e.g., matching the size of Q, K, or V). These biases are initialized to zeros and added immediately after each linear projection. This allows each output feature dimension to adjust independently by shifting its activation up or down during training.
- Query, Key, and Value matrices WQ,WK,WV∈Rdm×dh are learned parameter matrices for each head. They are initialized with small values drawn from a normal distribution with zero mean, updated during optimization following back propagation.
- We set the model dimension 4. With h=2 heads, the head dimension is dh=dm/h=2. This means that for each head the weight matrices are WQ∈R4×2, WK∈R4×2, and WV∈R4×2.
- Q, K, and V are usually initialized from a normal distribution N(0,σ2) with σ∼1/dm, but for our worked example we will instead use small hand-chosen values.
- Compute the linear projections Q, K, and V for head 1 and for head 2. For Head 1 and Head 2, the linear projections are obtained by multiplying the normalized input Y with the weight matrices. Multiplying the normalized input Y by these matrices produces the linear projections Q=YWQ, K=YWK, and V=YWV. Head 1: Q(1)K(1)V(1)=YWQ(1) =1.414−1.2910.942−1.4141.5140.2870.000−0.016−1.6800.000−0.2070.4510.1200.030−0.0200.010−0.0500.0250.015−0.030≈ 0.1273−0.11130.1598−0.10610.1084−0.0787=YWK(1) =1.414−1.2910.942−1.4141.5140.2870.000−0.016−1.6800.000−0.2070.4510.050−0.0100.030−0.020−0.0200.0400.0100.025≈ 0.0848−0.0760−0.0152−0.08480.0810−0.0129=YWV(1) =1.414−1.2910.942−1.4141.5140.2870.000−0.016−1.6800.000−0.2070.4510.0100.030−0.0200.0250.020−0.0100.015−0.030≈−0.02830.0050−0.03300.0424−0.00700.0220
- Next compute attention scores for each head S=QK⊤/dh, which measure the similarity between each query and key, indicating how much focus one token should place on another before masking and normalization.
- Adding the causal mask to S sets future tokens to −∞, ensuring that each token can only attend to itself and earlier tokens. The mask M is an L x L matrix where each row represents a query position and each column represents a key position. Zero entries allow attention for self or past positions and −∞ prevents attention for future positions. With L = 3.
- Applying the softmax function row-wise to the masked scores Smasked converts each row into a probability distribution. This ensures that the attention weights are non-negative and sum to 1, so each token’s output is a weighted average of the values it is allowed to attend to:
- Multiplying the attention weights A(h) by the value matrix V(h) computes the attention output for each head Z(h), where each token representation becomes a weighted sum of the value vectors according to its learned attention distribution.
- Concatenating the outputs from all heads combines their information into a single matrix. Each head output Z(h) has shape L×dh, and placing them side by side along the feature axis produces Z∈RL×dm. The model dimension is dm=h⋅dh. With h=2 and dh=2, the result has shape 3×4.
- Output projection mixes the information from different heads into a single representation. After concatenation, the heads are only placed side by side without interacting. The projection uses a learned weight matrix WO∈Rdm×dm to linearly combine the head outputs, producing O=ZWO. In practice, WO∈Rdm×dm is initialized with small random values, often from N(0,1/dm). For our toy example, we will use small hand-chosen values for WO to keep the calculations simple and easy to follow:
Adding the residual connection ensures that the original input to the block is preserved and combined with the transformed output. This helps stabilize training by allowing gradients to flow more easily and prevents important information from being lost through the attention and projection layers.
Residual connection:R=O+Y=−0.0040.001−0.0020.010−0.0050.0000.0020.002−0.002−0.006−0.0000.001+1.414−1.1990.914−1.4141.4170.2780.000−0.012−1.6560.000−0.2070.464≈1.410−1.1980.912−1.4041.4120.2780.002−0.010−1.658−0.006−0.2070.465The second LayerNorm works the same way as the first one: it normalizes each token’s vector (now the output of the attention + residual step) so that its values have mean zero and unit variance, which helps stabilize the scale of features before sending them into the MLP.
μi=d1j=1∑dRij, σi2=d1j=1∑d(Rij−μi)2R^ij=σi2+ϵRij−μi, Yij(2)=γjR^ij+βj(γ=[1,1], β=[0,0])μ0σ02r^0μ1σ12r^1μ2σ22r^2=41(1.410+(−1.404)+0.002+(−0.006))≈0.0005≈0.9905≈[1.414,−1.412,0.002,−0.006]=41(−1.198+1.412+(−0.010)+(−0.207))≈−0.00075≈0.868≈[−1.280,1.517,−0.011,−0.223]=41(0.912+0.278+(−1.658)+0.465)≈−0.00075≈0.9685≈[0.923,0.283,−1.687,0.473]Y(2)≈1.414−1.4140.952−1.4121.6660.2880.002−0.012−1.728−0.006−0.2440.485The MLP sublayer takes the normalized residual output and passes it through two linear layers with an activation (non-linear function) applied in between (GPT-2 uses GELU), the first expands the feature dimension, and the second projects it back down to dm, producing a transformed L×dm matrix for each token.
H=GELU(Y(2)W1+b1),U=HW2+b2- The parameters are as follows:
- For this worked example dm=4,dff=4dm=16. We choose small, hand-picked values to keep the arithmetic explicit:
- The first linear transformation multiplies the normalized input Y(2) by the weight matrix W1 and adds the bias b1. This produces the pre-activation matrix for the hidden layer.
- The activation function used in GPT-2 is the Gaussian Error Linear Unit (GELU), which smoothly weights inputs by their probability of being positive under a standard normal distribution. This provides a smoother nonlinearity than ReLU or tanh and improves performance in transformer models.
- Worked numbers (tanh approximation applied elementwise to Hpre and rounded to 3 d.p.):
- The activated hidden matrix H is multiplied by the weight matrix W2 and the bias b2 is added. This projects the representation back down from the expanded dimension dff to the model dimension dm.
- Note that GPT-2 used dropout after attention and in the MLP, but we omit it here for clarity. In modern models, dropout is uncommon, as scale and other regularization methods suffice to avoid overfitting.
The MLP output is added back to its input Y(2), preserving the original signal while allowing the new transformation to affect it. This gives the final output of the transformer block.
R(2)=U+Y(2)U=0.02279−0.022220.02966−0.019950.02376−0.030840.00886−0.01032−0.001570.00985−0.010440.01248,Y(2)=1.414−1.4140.952−1.4121.6660.2880.002−0.012−1.728−0.006−0.2440.485R(2)=U+Y(2)≈1.437−1.4360.9817−1.4321.6890.25720.0109−0.0223−1.72960.0039−0.25440.4975- The final hidden states are projected into the vocabulary space by multiplying with the embedding matrix transpose, producing logits for each token position. logits=R(2)E⊤R(2)EE⊤logits≈1.437−1.4360.9817−1.4321.6890.25720.0109−0.0223−1.72960.0039−0.25440.4975,=0.120−0.0800.050−0.0500.0900.0200.030−0.020−0.0700.025−0.0150.040,=0.120−0.0500.0300.025−0.0800.090−0.020−0.0150.0500.020−0.0700.040,=R(2)E⊤≈0.244465−0.2637990.065493−0.2441160.271152−0.0282590.042603−0.0466350.195201∈R3×3
The output projection ties weights with the input embedding matrix, so logits are computed as logits=R(2)E⊤instead of learning a separate output matrix.
The softmax function converts the logits into probabilities over the vocabulary, ensuring each row sums to 1; this makes the model output interpretable as next-token probabilities for inference and usable in cross-entropy loss during training.
pijp=∑k=13exp(logitsik)exp(logitsij),i=1,2,3, j=1,2,3≈0.41140.25320.32800.25240.43230.29860.33620.31460.3734- First token (
AI) → next token probabilities[0.41, 0.25, 0.34]correspond to["AI", "rocks", "<EOS>"]. The model most likely predictsAI(0.411). - Second token (
rocks) → next token probabilities[0.25, 0.43, 0.31]suggest the model is most confident that the next token isrocks(0.432). - Third token (
<EOS>) → probabilities[0.33, 0.30, 0.37]indicate the next token is most likely<EOS>(0.3362). - During inference, sampling or argmax is applied to the softmax probabilities. During training, cross-entropy loss is computed between the softmax distribution and the target tokens.
- With the softmax probabilities computed, the forward pass is complete, and the next step is to introduce the loss function and trace gradients backward through the network during backpropagation.
References
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (Vol. 30).
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
Next page
How LLMs learn (3 of 4): Back propagation
Last page
How LLMs learn (1 of 4): Tokenization
Home
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).