Decoder architectures

Complete GPT-2 reconstruction
Tokenization and embedding layer
Multi-head causal self-attention mechanism
Layer normalization and residual connections
Feed-forward networks
Text generation and final outputs

To go with our explanation of encoder architectures, we've undertaken a similarly detailed analysis of decoder-only models, focusing on GPT-2. Unlike MiniLM, which processes full sequences bidirectionally, GPT-2 is an auto-regressive model, it predicts the next token based solely on the tokens it has seen so far. Decoder architectures are the foundation of modern generative language models like GPT-2, GPT-3, and GPT-4. We choose GPT-2 as our generative decoder, an open-source model from OpenAI.

We can legally download all components, run computations, and compare results to ensure accuracy. We continue with "Open-source LLMs rock." as our sentence to decode. We extract necessary model components from the open source GPT-2 model for the rebuild, block-by-block, until the last stage where we reproduce the model's logits to within tolerable floating point precision error. The Jupyter notebook showing the ground up reconstruction of the operational version on GPT-2 is available now and the description follows below.

View the GPT-2 notebook—>

Bug reports and corrections are welcome!

Complete GPT-2 reconstruction

Our reconstruction covers the entire GPT-2 architecture from raw text input to final text generation. We manually implemented every component using extracted weights and achieved numerical precision within 1e-4 tolerance across all layers. GPT-2 is a 12-layer autoregressive decoder with the following specifications:

Hidden dimension: 768 features per token
Attention heads: 12 per layer (64 dimensions each)
Vocabulary: 50,257 tokens using BPE tokenization Position encoding:
Learned embeddings (GPT-style)
Feed-forward: GELU activation
Causal masking: Future tokens masked during attention

Tokenization and embedding layer

GPT-2 uses Byte Pair Encoding (BPE) tokenization without special tokens like [CLS] or [SEP]. Our input "Open-source LLMs rock." becomes: Open - source ĠLL Ms Ġrock . We manually extracted and empirically verified two embedding types:

Token embeddings: 50,257 × 768 matrix mapping vocabulary to vectors
Position embeddings: Learned 1024 × 768 matrix for positional information

We achieved near perfect numerical precision when summing both embedding types.

Multi-head causal self-attention mechanism

Each attention layer splits the 768-dimensional input across 12 heads of 64 dimensions each. Unlike bidirectional encoders, GPT-2 uses causal attention where each token can only attend to previous tokens and itself. We manually performed the following steps:

Manually computed Q, K, V projections using extracted weight matrices Implemented scaled dot-product attention with causal masking: Attention(Q,K,V) = softmax(QK^T/√d_k + mask)V
Applied causal mask to prevent attention to future positions Verified attention scores for each head individually (max difference < 1e-6)
Concatenated multi-head outputs and applied output projection

We achieved numerical precision within 1e-4 tolerance across all attention computations.

Layer normalization and residual connections

GPT-2 follows a pre-attention LayerNorm pattern: Input → LayerNorm → Attention → Residual connection Result → LayerNorm → Feed-forward → Residual connection.

We manually implemented the LayerNorm function and achieved close to perfect numerical alignment with GPT-2's implementation across all 12 layers.

Feed-forward networks

Each layer contains a two-layer MLP that expands from 768 to 3,072 dimensions, then contracts back. We used PyTorch's GELU for exact matching with GPT-2's implementation, achieving numerical alignment within 1e-4 tolerance with the official model across all blocks.

Text generation and final outputs

GPT-2 converts the final hidden representations to next-token predictions through a language modeling head that shares weights with the token embeddings. Our manual reconstruction produces identical token predictions: ".source codeVM are.", a fully functional and manually implemented version of GPT-2.

Next page

Encoder-decoder architectures

Last page

Encoder architectures

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).