What's in a Large Language Model? Reverse-Engineering Open Source Transformers

LLM designs are complex
Encoders, decoders, and encoder-decoders
Open source transparency
Common data sources
Reconstructions as a first step
Part 1. Download a local model copy
Part 2. Understand model architecture
Part 3. Build and verify interim outputs
Part 4. Traps to avoid
References

LLM designs are complex

All the King’s horses and all the King’s men, Couldn’t put the LLM together again.

As this twist on the famous nursery rhyme suggests, knowing the parts of an LLM conceptually is one thing, disassembling and rebuilding one accurately is another altogether. This section discusses a strategy for doing so, focusing on the three primary transformer architectures, encoders, decoders, and encoder-decoders. In the next sections of the book, we’ll put the strategy into action with applied reconstructions of widely used LLMs. We’ll use the MiniLM sentence encoder, GPT-2 decoder, and T5-small encoder-decoder. Importantly, I do not consider this process building language models from scratch. Instead this is ‘decoupling and reconstructing’. For training from scratch, see my Seedling case study in a later section of this book.

Encoders, decoders, and encoder-decoders

Encoders create contextualised embeddings that are used in later analyses like prediction and classification. They use bidirectional self-attention, meaning they attend to tokens before and after the current token simultaneously during inference. This makes them suited to tasks that need full sentence understanding. Decoders use autoregressive (causal) attention with masked prediction, meaning they only see previous tokens. They are suited to generative tasks like producing text. Encoder-decoders are good for sequence-to-sequence tasks like translating. They use bidirectional self-attention in the encoder and causal attention in the decoder.

💡

Large Language Model’s (LLMs) are transformers that have been trained on vast quantities of natural language data. ‘Large’ refers here to the amount of language data used and the number of model parameters. LLM is sometimes reserved for the decoder and encoder-decoder transformer variations.

Open source transparency

LLMs are sophisticated tools that are commonly described as black boxes. This usually means that their internal mechanisms can’t be scrutinized. While commercial models do not allow scrutiny of internal designs (model parameters are kept server side), open source LLMs do. Open source LLMs are only functionally black boxes, meaning it’s just impractical to reverse engineer them. But LLMs can be disassembled and reassembled in a manner that is decoupled from the official models when they are open source.

As an example, in the upcoming demonstrations, we use Pytorch’s register_forward_hook on each layer of several LLMs to capture matrix projections. The motivation is primarily instructional but recognizes the utility of the understanding gained, should parameters of LLMs allow psychological interpretations in future. Reverse engineering here means replicating the inner workings of the LLM at inference, rather than training the LLM from the ground up. Training and fine-tuning are addressed in subsequent sections.

Common data sources

Training data sources are rarely disclosed by the large model providers. This may in part stem from concerns over copyright. Alex Reisner wrote a piece in The Atlantic that described how 191,000 books were used to train LLMs without user consent (my first analytics book with Jonathan Ferrar and Sheri Feinzig, The Power of People, was one of these!). While we do not know the precise details for all the proprietary models, Liu et al. (2023) reported a survey of the common data sources used in LLM training.

They found they could be categorised into 8 major categories. Most training involves mixed corpora from across the following categories webpages, constructed texts (e.g., American National Corpus and British National Corpus), books, academic materials, code, parallel corpus data (i.e., texts in multiple languages), social media and encyclopaedia data sets. As human data is running low, these models are being trained on AI generated data which may lead to an overall degradation in the quality of models as time progresses.

The data strategies of many of the most widely used proprietary models are closely held secrets. This makes assuaging the earlier referenced concerns about bias difficult. Bommasani et al. (2023, 2024) developed the Foundation Model Transparency Index which evaluates the transparency of model providers regarding upstream resources (e.g. data, labor) involved in building their models, capabilities and limitations of the models themselves, and downstream monitoring and accountability.

Open source models were more forthcoming about these issues while closed source models were less transparent. The least transparency was shown in relation to the upstream resources used in the development of the models, such as the data used and the role of human labour. Bommasani et al. (2023) ranked models from most to least transparent. Familiar names like Claude 3, GPT-4 and Gemini were in the bottom half of the table. Model information is constantly updated by developers so should be checked regularly.

Reconstructions as a first step

The approach we discuss reveals the computational methods of LLMs. We expect this functional view is required for understanding behaviours that result from the interaction of the individual components but which have not been explicitly programmed and are difficult to predict (i.e., so-called ‘emergent behaviours’). However, tracing emergent behaviours to individual weights has proven notoriously difficult.

Weight matrices and activation patterns alone do not reveal what models have learned or what biases they have acquired. These attributes emerge from the interplay between architecture and training data. New methods may be needed to understand these issues as LLMs evolve. The complexity of what we propose also increases dramatically with newer architectures (e.g., mixture of experts). Yet capturing model workings could be useful for work aiming to understand emergent phenomena.

Part 1. Download a local model copy

The first step in reverse engineering is choosing an open-source transformer to reverse engineer. MiniLM, GPT-2, and T5-small are good examples. Hugging Face’s transformers package stores the model configuration in config.json files and weights in a pytorch_model.bin file. See Hugging Face’s save_pretrained / from_pretrained commands.

Choices that we will use here are MiniLM for an encoder, GPT-2 for a decoder, and T5-small as the encoder-decoder. In each case, the important first step is to download an offline copy of the model and all its components, such as token embeddings, learned position embeddings, token type embeddings where relevant, and weight matrices Q_w, K_w, and V_w (see Vaswani et al., 2017).

Part 2. Understand model architecture

The next stage is to get a solid understanding of how tokens flow through the model, transforming inputs into representations and responses. The architecture describes the format the LLM takes as input, its tokenization approach, along with any positional encodings, the feed forward multilayer perceptron, the attention mechanisms it uses, the number of layers and how they're configured, and normalization strategies.

The model architecture also defines how outputs are produced, i.e., whether they will take the form of numerical embeddings or token sequences for text generation. With this knowledge, we can begin the decoupled replication process one step at a time. Knowing where you are in the rebuild and how much remains to be mapped is reassuring. You will likely also discover that once you have the first transformer block coded correctly, the rest are identical blocks until the last block where the output is produced.

Part 3. Build and verify interim outputs

From this point, the process involves running the official LLM and stopping the model using hooks. Hooks allows you to capture interim outputs, such as the tokenized input text, the input tensor that enters the first transformer block, the projection of the input tensor into different planes using weight matrices, and so on.

The information to capture is the exact values in the matrices as ground truth and the dimensionality of every input matrix and every generated matrix. When you reimplement in Python using Pytorch, you’ll have checkpoints against which to compare your progress. This is helpful for debugging errors, if your output is a different dimension from the model’s at any checkpoint, you may be further away from the correct configuration than if it is only the weights that differ but have the correct matrix dimensions.

If your rebuild is working, you should expect accuracy to at least 5 decimal places when making element wise comparisons of floating point values from official and reimplemented blocks (noting that close matching requires similar hardware and software environments). You can repeat this process for every block in your LLM architecture, checking that your matching precision is acceptable at every stage, until you reach the end.

Part 4. Traps to avoid

A number of these steps I only realised part way through the process, so my reconstructions of MiniLM and GPT2 did not follow every step. For example, I call the live model weight matrices, on occasion, instead of looking up the offline model counterparts. This does not invalidate the reconstruction, but the decoupling in my notebooks would have been more complete had I used these offline versions of the model’s inner components.

Another sticking point is transformer models may depart from textbook methods. Perhaps they add positional embeddings or token type encodings learned in training, instead of sinusoidal positional encodings or fixed type embeddings. Also, for a thoroughly decoupled reconstruction, you must avoid accidentally dragging an interim official model output matrix into the deconstructed pipeline, or accidentally comparing an interim reconstructed output to itself, rather than the model equivalent, during a ground truth comparison.

Finally, it might simply be impossible to discern how the output from one stage leads to the next in some instances, these processes can be challenging to capture with hooks. In these cases, your options are limited. (So far in our demonstrations that follow we have not encountered this). However, faced with such situations, you may decide to take the interim model output forward in your decoupled pipeline, particularly if the goal is primarily instructional.

References

Bommasani, R., Klyman, K., Longpre, S., Kapoor, S., Maslej, N., Xiong, B., ... & Liang, P. (2023). The foundation model transparency index. arXiv preprint arXiv:2310.12941.

Bommasani, R., Klyman, K., Longpre, S., Xiong, B., Kapoor, S., Maslej, N., ... & Liang, P. (2024, October). Foundation model transparency reports. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (Vol. 7, pp. 181-195).

Liu, Y., Cao, J., Liu, C., Ding, K., & Jin, L. (2024). Datasets for large language models: A comprehensive survey. arXiv preprint arXiv:2402.18041.

Guo, Y., Guo, M., Su, J., Yang, Z., Zhu, M., Li, H., ... & Liu, S. S. (2024). Bias in large language models: Origin, evaluation, and mitigation. arXiv preprint arXiv:2411.10915.

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., ... & Sifre, L. (2022). Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.

Wang, Z., Zhong, W., Wang, Y., Zhu, Q., Mi, F., Wang, B., ... & Liu, Q. (2023). Data management for large language models: A survey. CoRR.

Next section

Encoder architectures

Last section

Substitutability assumption

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).