Our objective is to manually work through a full GPT-2 style block to see exactly how embeddings, multi-head attention, residuals, layer norms, the MLP, and the projection to logits interact and then to calculate loss and compute the gradients via backpropagation.
Here we cover the forward pass, which happens during inference and stage one of learning. Later sections cover back propagation, and optimization (weight updating). We use token embedding dimensions and heads and otherwise stay faithful to GPT-2.
‣
‣
‣
‣
‣
‣
‣
‣
‣
‣
‣
Next page
Seedling: How do you train an LLM?
Last page
What are encoder-decoder architectures?
Home