- Complete GPT-2 reconstruction
- Tokenization and embedding layer
- Multi-head causal self-attention mechanism
- Layer normalization and residual connections
- Feed-forward networks
- Text generation and final outputs
To go with our explanation of encoder architectures, we've undertaken a similarly detailed analysis of decoder-only models, focusing on GPT-2. Unlike MiniLM, which processes full sequences bidirectionally, GPT-2 is an auto-regressive model, it predicts the next token based solely on the tokens it has seen so far. Decoder architectures are the foundation of modern generative language models like GPT-2, GPT-3, and GPT-4. We choose GPT-2 as our generative decoder, an open-source model from OpenAI.
We can legally download all components, run computations, and compare results to ensure accuracy. We continue with "Open-source LLMs rock." as our sentence to decode. We extract necessary model components from the open source GPT-2 model for the rebuild, block-by-block, until the last stage where we reproduce the model's logits to within tolerable floating point precision error. The Jupyter notebook showing the ground up reconstruction of the operational version on GPT-2 is available now and the description follows below.
Bug reports and corrections are welcome!
Complete GPT-2 reconstruction
Our reconstruction covers the entire GPT-2 architecture from raw text input to final text generation. We manually implemented every component using extracted weights and achieved numerical precision within 1e-4 tolerance across all layers. GPT-2 is a 12-layer autoregressive decoder with the following specifications:
- Hidden dimension: 768 features per token
- Attention heads: 12 per layer (64 dimensions each)
- Vocabulary: 50,257 tokens using BPE tokenization Position encoding:
- Learned embeddings (GPT-style)
- Feed-forward: GELU activation
- Causal masking: Future tokens masked during attention
Tokenization and embedding layer
GPT-2 uses Byte Pair Encoding (BPE) tokenization without special tokens like [CLS] or [SEP]. Our input "Open-source LLMs rock." becomes: Open - source ĠLL Ms Ġrock . We manually extracted and empirically verified two embedding types:
- Token embeddings: 50,257 × 768 matrix mapping vocabulary to vectors
- Position embeddings: Learned 1024 × 768 matrix for positional information
We achieved near perfect numerical precision when summing both embedding types.
Multi-head causal self-attention mechanism
Each attention layer splits the 768-dimensional input across 12 heads of 64 dimensions each. Unlike bidirectional encoders, GPT-2 uses causal attention where each token can only attend to previous tokens and itself. We manually performed the following steps:
- Manually computed Q, K, V projections using extracted weight matrices Implemented scaled dot-product attention with causal masking: Attention(Q,K,V) = softmax(QK^T/√d_k + mask)V
- Applied causal mask to prevent attention to future positions Verified attention scores for each head individually (max difference < 1e-6)
- Concatenated multi-head outputs and applied output projection
We achieved numerical precision within 1e-4 tolerance across all attention computations.
Layer normalization and residual connections
GPT-2 follows a pre-attention LayerNorm pattern: Input → LayerNorm → Attention → Residual connection Result → LayerNorm → Feed-forward → Residual connection.
We manually implemented the LayerNorm function and achieved close to perfect numerical alignment with GPT-2's implementation across all 12 layers.
Feed-forward networks
Each layer contains a two-layer MLP that expands from 768 to 3,072 dimensions, then contracts back. We used PyTorch's GELU for exact matching with GPT-2's implementation, achieving numerical alignment within 1e-4 tolerance with the official model across all blocks.
Text generation and final outputs
GPT-2 converts the final hidden representations to next-token predictions through a language modeling head that shares weights with the token embeddings. Our manual reconstruction produces identical token predictions: ".source codeVM are.", a fully functional and manually implemented version of GPT-2.
Next page
Last page
Return home
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).