Our objective is to manually work through a full GPT-2 style block to see exactly how embeddings, multi-head attention, residuals, layer norms, the MLP, and the projection to logits interact and then to calculate loss, compute the gradients via back propagation, and update weights using optimisation methods.
Here we cover the forward pass, which happens during inference and during the first stage of learning. The forward pass processes the text input through the network architecture to produce probabilities for the next tokens, where the targets are the same sequence shifted one position forward so that each input token is trained to predict the following token. Later sections cover back propagation, and optimization (weight updating).
Alert: Here I highlight the steps and computations of the forward pass procedure with a numerical example to make the ideas concrete. The approach is consistent with our simplified GPT-2 model. These results are preliminary and will be validated with a formal mathematical verifier in due course.
We will use three dimensions, which gives us room to represent two input tokens (‘AI’ and ‘rocks’) plus an EOS token for proper token handling. It also ensures that LayerNorm and other stages do not collapse the gradient during backpropagation. This makes the toy both realistic and pedagogically useful. We use = 1 layer, token embedding dimensions and heads and otherwise stay faithful to GPT-2.
References
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (Vol. 30).
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
Next page
How LLMs learn (3 of 4): Back propagation
Last page
How LLMs learn (1 of 4): Tokenization
Home
Psychometrics.ai