Encoder architectures

Encoder architectures

As a vehicle for this explanation, we provide a layer-by-layer manual reconstruction of the internal workings of encoders using MiniLM-L6-H384-uncased. We choose MiniLM as our sentence encoder, an Apache-2.0 open-source model. We can legally download all components, run computations, and compare results to ensure accuracy. We choose Hussain et al.’s (2024) “Open-source LLMs rock.” as our sentence to encoder. The description follows below, the Jupyter notebook showing the ground up reconstruction of the operational version on MiniLM is available now.

View the MiniLM notebook—>

Bug reports and corrections are welcome!

Complete MiniLM reconstruction

Our reconstruction covers the entire MiniLM-L6-H384-uncased architecture from raw text input to final sentence embeddings. We manually implemented every component using extracted weights and achieved numerical precision within 1e-5 tolerance across all layers. MiniLM is a 6-layer bidirectional encoder with the following specifications:

  • Hidden dimension: 384 features per token
  • Attention heads: 12 per layer (32 dimensions each)
  • Vocabulary: 30,522 tokens using WordPiece tokenization
  • Position encoding: Learned embeddings (BERT-style)
  • Feed-forward: GELU activation

Tokenization and embedding layer

MiniLM uses BERT's tokenization approach with special tokens [CLS] and [SEP]. Our input "Open-source LLMs rock." becomes: [CLS] open - source ll ##ms rock . [SEP]. We manually extracted and empirically verified three embedding types:

  • Token embeddings: 30,522 × 384 matrix mapping vocabulary to vectors
  • Position embeddings: Learned 512 × 384 matrix for positional information
  • Token type embeddings: 2 × 384 matrix distinguishing sentence A/B pairs

We achieved a close to perfect numerical match when summing all three embedding types.

Multi-head self-attention mechanism

Each attention layer splits the 384-dimensional input across 12 heads of 32 dimensions each. Unlike GPT-2's causal attention, MiniLM uses bidirectional attention where each token can attend to all other tokens. We manually performed the following steps.

  • Manually computed Q, K, V projections using extracted weight matrices
  • Implemented scaled dot-product attention: Attention(Q,K,V) = softmax(QK^T/√d_k)V
  • Verified attention scores for each head individually (max difference < 1e-8)
  • Concatenated multi-head outputs and applied output projection

We again achieved a close to perfect numerical match when summing all three embedding types.

Layer normalization and residual connections

MiniLM follows BERT's post-attention LayerNorm pattern:

  • Input → LayerNorm → Attention → Residual connection
  • Result → LayerNorm → Feed-forward → Residual connection

We manually implemented the LayerNorm function and LayerNorm outputs matched MiniLM within 1.9e-6 tolerance across all layers. Once again, we achieved a close to perfect numerical match when summing all three embedding types.

Feed-forward networks

Each layer contains a two-layer MLP that expands from 384 to 1,536 dimensions, then contracts back. We used PyTorch's GELU for exact matching with MiniLM's implementation, achieving near perfect numerical alignment with the official model.

Sentence pooling and final embeddings

MiniLM converts token-level representations to sentence-level embeddings through CLS token extraction, Pooler transformation, producing the final output: 384-dimensional sentence embedding. Our manual pooler achieved cosine similarity of 1.0 with MiniLM's official output. This represents a fully functional, manually implemented version of MiniLM-L6-H384-uncased.

Next section

Decoder architectures

Last section

Understanding LLM designs

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

image