How LLMs learn part 1: forward pass

How LLMs learn part 1: forward pass

Our objective is to manually work through a full GPT-2 style block to see exactly how embeddings, multi-head attention, residuals, layer norms, the MLP, and the projection to logits interact and then to calculate loss and compute the gradients via backpropagation.

Here we cover the forward pass, which happens during inference and stage one of learning. Later sections cover back propagation, and optimization (weight updating). We use dm=2d_m = 2 token embedding dimensions and h=2h=2 heads and otherwise stay faithful to GPT-2.

Vocabulary and tokenization
Embedding lookup from EE
Positional embeddings PP
LayerNorm 1
Multi-head self attention
Residual connection
LayerNorm 2
Multilayer perceptron
Residual connection (post MLP)
Logit projection
Softmax

Next page

Seedling: How do you train an LLM?

Last page

What are encoder-decoder architectures?

Home

Psychometrics.ai