- Model inference versus model training
- Common training data sources (Liu et al., 2023)
- Little transparency exists (Bommasani et al., 2023, 2024)
- Training a small LLM from scratch
- Part one: PyTorch training code
- Part two: Amazon Machine Image with GPU
- AWS requires compute justification
- AWS launch rights approval for GPU AMIs
- Configuring your AMI for Launch
- References
In this section we describe training a small LLM ourselves, including what to try, what might be learned, and hurdles you might encounter. This is an interim report on how the experience of training an LLM is going and will be updated as progress is made.
Model inference versus model training
When we looked at what happens inside MiniLM and GPT2 at inference (i.e., during live operational use) we saw transformers use advanced matrix operations (attention mechanisms, MLPs, residual connections etc) to create ‘context aware’ representations or token completions. See the earlier sections for notebooks that decouple and reconstruct MiniLM and GPT inference processes. The process involves applying pre-trained weight matrices to other matrices representing the input text. Framing GenAI as ‘completions’ reinforces that the model isn’t aware it’s talking, it’s just predicting the next token.
But where do those weight matrices (e.g., Q, K, V) we discussed come from? We’re encouraged to accept that these come from black boxes, beyond our reach in the scale of training data needed, compute power required and financial cost. The received wisdom is its worth fine tuning these models but not training from the ground up, as we’ll never get the commercial level performance. But you can train a small LLM to a working level. There are several reasons you might want to, e.g., if you have sensitive data that can be shared or need to build a model that has no contamination.
Another reason is that when training an LLM from scratch, you can see what happens to the weights in key model matrices as a function of different model architectures, initialisation decisions, data sources, and training designs. Even though LLMs are complex systems that have emergent properties not easily traceable to individual weights, knowing the origins of these weights might aid interpretative efforts or contribute to theory. Readers interested in entry points to exploring model interpretability topic further can refer to the recent review of LLM model interpretability by Zhao et al. (2024). This section is primarily about introducing AI model training.
Common training data sources (Liu et al., 2023)
Training data sources are rarely disclosed by the large model providers. This may in part stem from concerns over copyright. Alex Reisner wrote a piece in The Atlantic that described how 191,000 books were used to train LLMs without user consent (my first analytics book with Jonathan Ferrar and Sheri Feinzig, The Power of People, was one of these!).
While we do not know the precise details for all the proprietary models, Liu et al. (2023) reported a survey of the common data sources used in LLM training. They found they could be categorised into 8 major categories, Chinese and English were reported as the most common languages. Most training involves mixed corpora from across the following categories:
- Webpages
- Constructed texts (e.g., American National Corpus and British National Corpus)
- Books
- Academic materials
- Code
- Parallel corpus data (i.e., texts in multiple languages)
- Social Media
- Encyclopaedia data sets
Very importantly, it is increasingly the case that as original human data is running low, these models begin to be trained on AI generated data. This brings its own challenges, such as amplying errors on the original models, reductions in the originality of content AI models produce, and an overall degradation in the quality of models as time progresses.
Little transparency exists (Bommasani et al., 2023, 2024)
The data strategies of many of the most widely used proprietary models are closely held secrets. This makes assuaging the earlier referenced concerns about bias difficult. Bommasani et al. (2023, 2024) developed the Foundation Model Transparency Index which evaluates the transparency of model providers regarding upstream resources (e.g. data, labor) involved in building their models, capabilities and limitations of the models themselves, and downstream monitoring and accountability.
Open source models (e.g. StarCoder by BigCode/HuggingFace/ServiceNow) were more forthcoming about these issues while closed source models (e.g. OpenAI) were, historically at least, less transparent. In all cases, the least transparency was shown in relation to the upstream resources used in the development of the models, such as the data used and the role of human labour. According to Bommasani et al. (2023), models at the time of publication could be ranked from most to least transparent with respect to data management as follows (see provider pages for their latest model information):
- StarCoder by BigCode/HuggingFace/ServiceNow
- Jurassic-2 by AI21 Labs
- Luminous by AlephAlpha
- Granite by IBM
- Phi-2 by Microsoft
- Lama 2 by Meta
- Stable Video Diffusion by stability.ai
- Palmyra-X by Writer
- Mistral 7B by Mistral AI
- Claude 3 by Anthropic
- GPT-4 by OpenAI
- Gemini by Google
- Titan Text Express by Amazon
- Adept by Fuyu-8B
Training a small LLM from scratch
The train-from-scratch challenge has two parts, coding and launching on a commercial cloud computing vendor's servers. First, we want to highlight the logical inconsistency of training a 'small' Large Language Model! However, the point of using a smaller model is to run all training requirements at a smaller scale. We've deliberately avoided advanced ML operationshere that are needed for larger models. These include approaches such as Kubernetes orchestration and other commercial-scale training technologies.
Part one: PyTorch training code
First is writing the Python and PyTorch code to sample a small subset of data (e.g. 500mb or 125 million tokens from ‘The Pile’ or dialogue data for a chat model), build a tokenizer (e.g. Byte Pair Encoding), tokenize the training data, split it into training and hold out data, create training batches, specify an architecture (e.g., GPT-2-small with 117m parameters), hook it up to a training loop and run the model. This will not produce anywhere near the experience users have with commercial LLMs but the goal here is didactic.
Part two: Amazon Machine Image with GPU
For part two, we need an NVIDIA virtual machine. AWS was my first choice. The AWS control panel is quite complex at first encounter, but you start to get the hang of it with repeated exposure. I’m only trying to launch a basic VM, it is a much tougher task to build AI models requiring sustained access, monitoring, and cost that comes with operational deployment. But even the small job of setting up an EC2 instance on an Amazon Machine Image using a g4dn.xlarge (1 × T4 GPU) failed, perhaps my account is too new to make such a request.
The online advice is to request a quota increase with a better justification, use on demand compute resources which are instantly available at an hourly rate rather than requesting spot compute resources, which involve a competitive bidding process that secures the computing power at large discounts. Some people suggested trying a smaller job first and try different geographies with better GPU availability (U.S East Coast is competitive, Ohio is good, London is bad). Despite an initial justification being rejected, a more thorough justification of the request for GPU processing power was approved. Readers may want to see the service quota justification request that secured launch rights, it is listed below followed by the approval.
AWS requires compute justification
Dear AWS, thank you for reviewing our request. We are a startup developing AI models to support psychological assessments. Our work involves training custom language models with deep learning that are designed to interpret and respond to user input in sensitive leadership assessment contexts. Training these deep learning models, especially modern neural networks like GPT architectures, requires significant computing power, particularly from GPUs. To manage costs while meeting these technical demands, we’re requesting an increase in our quota for GPU-based Spot Instances (such as g5.xlarge, g4dn.xlarge, and similar). Our expected usage will be under XX hours over XX weeks with estimated costs under XX. We understand Spot Instances can be interrupted and have designed our training process to handle this. Our planned usage will be short-term (XX weeks), and we’re actively monitoring our resource usage and costs with AWS tools. We very much appreciate your reconsideration of this request and are happy to provide any additional details. Best wishes, Nigel Guenole,
AWS launch rights approval for GPU AMIs
Configuring your AMI for Launch
References
Bommasani, R., Klyman, K., Longpre, S., Kapoor, S., Maslej, N., Xiong, B., ... & Liang, P. (2023). The foundation model transparency index. arXiv preprint arXiv:2310.12941.
Bommasani, R., Klyman, K., Longpre, S., Xiong, B., Kapoor, S., Maslej, N., ... & Liang, P. (2024, October). Foundation model transparency reports. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (Vol. 7, pp. 181-195).
Liu, Y., Cao, J., Liu, C., Ding, K., & Jin, L. (2024). Datasets for large language models: A comprehensive survey. arXiv preprint arXiv:2402.18041.
Guo, Y., Guo, M., Su, J., Yang, Z., Zhu, M., Li, H., ... & Liu, S. S. (2024). Bias in large language models: Origin, evaluation, and mitigation. arXiv preprint arXiv:2411.10915.
Wang, Z., Zhong, W., Wang, Y., Zhu, Q., Mi, F., Wang, B., ... & Liu, Q. (2023). Data management for large language models: A survey. CoRR.
Next page
Last page
Return home