- Complete MiniLM reconstruction
- Tokenization and embedding layer
- Layer normalization and residual connections
- Feed-forward networks
- Sentence pooling and final embeddings
As a vehicle for this explanation, we provide a layer-by-layer manual reconstruction of the internal workings of encoders using MiniLM-L6-H384-uncased. We choose MiniLM as our sentence encoder, an Apache-2.0 open-source model. We can legally download all components, run computations, and compare results to ensure accuracy. We choose Hussain et al.’s (2024) “Open-source LLMs rock.” as our sentence to encoder. The description follows below, the Jupyter notebook showing the ground up reconstruction of the operational version on MiniLM is available now.
Bug reports and corrections are welcome!
Complete MiniLM reconstruction
Our reconstruction covers the entire MiniLM-L6-H384-uncased architecture from raw text input to final sentence embeddings. We manually implemented every component using extracted weights and achieved numerical precision within 1e-5 tolerance across all layers. MiniLM is a 6-layer bidirectional encoder with the following specifications:
- Hidden dimension: 384 features per token
- Attention heads: 12 per layer (32 dimensions each)
- Vocabulary: 30,522 tokens using WordPiece tokenization
- Position encoding: Learned embeddings (BERT-style)
- Feed-forward: GELU activation
Tokenization and embedding layer
MiniLM uses BERT's tokenization approach with special tokens [CLS] and [SEP]. Our input "Open-source LLMs rock." becomes: [CLS] open - source ll ##ms rock . [SEP]. We manually extracted and empirically verified three embedding types:
- Token embeddings: 30,522 × 384 matrix mapping vocabulary to vectors
- Position embeddings: Learned 512 × 384 matrix for positional information
- Token type embeddings: 2 × 384 matrix distinguishing sentence A/B pairs
We achieved a close to perfect numerical match when summing all three embedding types.
Multi-head self-attention mechanism
Each attention layer splits the 384-dimensional input across 12 heads of 32 dimensions each. Unlike GPT-2's causal attention, MiniLM uses bidirectional attention where each token can attend to all other tokens. We manually performed the following steps.
- Manually computed Q, K, V projections using extracted weight matrices
- Implemented scaled dot-product attention: Attention(Q,K,V) = softmax(QK^T/√d_k)V
- Verified attention scores for each head individually (max difference < 1e-8)
- Concatenated multi-head outputs and applied output projection
We again achieved a close to perfect numerical match when summing all three embedding types.
Layer normalization and residual connections
MiniLM follows BERT's post-attention LayerNorm pattern:
- Input → LayerNorm → Attention → Residual connection
- Result → LayerNorm → Feed-forward → Residual connection
We manually implemented the LayerNorm function and LayerNorm outputs matched MiniLM within 1.9e-6 tolerance across all layers. Once again, we achieved a close to perfect numerical match when summing all three embedding types.
Feed-forward networks
Each layer contains a two-layer MLP that expands from 384 to 1,536 dimensions, then contracts back. We used PyTorch's GELU for exact matching with MiniLM's implementation, achieving near perfect numerical alignment with the official model.
Sentence pooling and final embeddings
MiniLM converts token-level representations to sentence-level embeddings through CLS token extraction, Pooler transformation, producing the final output: 384-dimensional sentence embedding. Our manual pooler achieved cosine similarity of 1.0 with MiniLM's official output. This represents a fully functional, manually implemented version of MiniLM-L6-H384-uncased.
Next section
Last section
Return home
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).