Seedling: LLM training case study

Building an LLM from scratch
Introducing Seedling
Approach to design
Performance of Seedling
Accessing Seedling code and best model (.pt)
Why Seedling?
Next steps

Building an LLM from scratch

Work on Large Language Models (LLMs) in psychometrics has focused mainly on the application of existing LLMs to conventional psychological measurement tasks or fine-tuning them for improved performance. Despite useful early results, a provocative case can be made that work so far reflects psychological engineering, emphasizing practical performance over understanding of underlying LLM mechanisms and processes that allow such utility.

Much psychometric LLM research focuses on Marr’s (1982) computational level. This level defines the problem being solved, why it matters, and how successful the LLM is at the task(s). Yet it neglects questions of whether LLMs use psychological representations and processes (i.e., algorithmic level explanations) and how LLMs process inputs, weights, and outputs (i.e., implementation-level explanations) to achieve human goals.

For a complete picture of LLM performance explanations need to span multiple levels. To augment existing understanding of LLM behavior will require focusing on alternative levels of analysis. This is challenging with commercial LLMs because the models are either proprietary, where no details are revealed, or they have hundreds of millions of parameters, making parameter evolution tracking and psychometric evaluation unrealistic.

Psychologists will likely need to work with smaller and open models where they control the design of LLMs from the ground-up, observing performance differences on the quality of psychological measurement. Doing so may be technically challenging for some psychologists, but we show in this chapter that for a research psychologist with methods expertise (e.g., good experience fitting and debugging SEM and IRT models and designing Monte Carlo studies), training an LLM from the ground up is eminently feasible. It is also important not to underplay the complexity of doing so, it will take considerable time and effort to learn new concepts. However, it is within the capabilities of quantitative psychologists.

Introducing Seedling

In response to the perception that building an LLM might be a stretch for research psychologists, the purpose of this section is to show that creating such models is within the grasp of many psychologists with research methods backgrounds. We reinforce this message by announcing the release of Seedling, a small GPT style, ~50 million parameter LLM coded from the ground up using PyTorch.

Seedling uses a custom tokenizer trained on 10 gigabytes of OpenWebText with a 50,000-word vocabulary. Seedling has 8-layers, 8-heads, and 512-dimensional embeddings trained with batch sizes of 8 sequences and gradient accumulation every 32 batches.

Weight initialization used scaled normal distributions to prevent gradient explosion during early training phases. The dataset gives 29,171,354 sequences. An effective batch size of 256 (batch size =8 * gradient accumulation every 32 steps) gives approximately 113k gradient accumulation steps per epoch.

Seedling was trained for 2 epochs with a learning rate =1e-4 using AdamW optimization. The model uses causal self-attention with a context length of 1024 tokens, GELU activation functions, and layer normalization. Weight tying between input embeddings and output projection reduced parameter count while maintaining performance.

Approach to design

Seedling was created for under US$250 of compute time. Seedling’s architecture draws heavily on the GPT‑2 family. We systematically decomposed GPT‑2 into its core modules (tokenization, embedding, attention, feedforward, etc.) and reconstructed them in PyTorch to validate our understanding of each component (see earlier sections of this book).

To accelerate development, we leveraged LLM-assisted coding for boilerplate generation and iteratively tested and refined each component until we achieved a robust end-to-end training pipeline (see earlier sections where we deconstructed open source encoder and decoder models for details).

Seedling was trained using an Amazon Web Services Elastic Compute Cloud (EC2) session with a Deep Learning Amazon Machine Image (AMI) with PyTorch 2.0.1 (Amazon Linux 2) and a g4dn.2xlarge instance, a mid-level and economical instance for machine learning that provides 8 vCPUs, 32 GiB of memory, and a T4 GPU optimized for machine learning.

Performance of Seedling

Training used mixed precision for memory efficiency with gradient clipping (max norm .50) and a cosine learning rate decay with 5000 warmup steps to help stable convergence. With a 50,000 token vocab random loss is log(50,000) = 10.82. After circa 1.5 epochs, cross entropy loss plateaued at 4.6394, corresponding to perplexity of e4.6394 ≈103. This means on average Seedling considers 103 words equally likely at each prediction step.

Generation produces grammatically correct sentences with real words, but sentences lack meaning and contain factual inaccuracies. What do text completions look like at this level of loss and perplexity? Here are a few prompt completions, the prompt precedes the arrow, the completion follows the arrow.

'Intelligence is the ability to' ➜ make a living, and to be able to live in a world that is not the same as the world. It is also the ability of the human being to do the same thing, so it can be a very important part of the human being.

'The best career advice i ever received is' ➜ to make sure you're not going anywhere. This is to provide you with the necessary tools to show off your skills and get more training, as well as less awareness about the game.

'The sky is' ➜ still very dark, and the sun is just like a white-and-white star. But you can see the sky in the sky as it moves away from the Earth’s surface — and it looks like it has been, at least for a while now.

'An elephant is bigger than a' ➜ human being, and it is not a human being. It’s a human being that has to be in the same place as a human being. And if you want to know what’s going on with that, then you need to understand that you can’t do that.

While Seedling’s outputs are clearly far from human-like in coherence and factual accuracy, its level of performance is consistent with its size (~50M parameters), training data (10GB), and compute budget (under US$250). Small LLMs typically achieve grammaticality and word-level plausibility before global coherence or semantic accuracy emerge.

Accessing Seedling code and best model (.pt)

The full code for Seedling, including a thorough testing script with nucleus sampling and temperature generation sweeps, is available on GitHub. The implementation includes comprehensive error handling, gradient explosion mitigation, and checkpoint resumption capabilities for stable small-scale LLM training. The best model as a PyTorch (.pt) file is too large for Github but is available from DropBox.

Why Seedling?

The name Seedling is chosen to reflect the inchoate state of this LLM, but to also emphasize improvements will emerge with larger and more diverse data sets, deeper and wider architectural choices, and increased computing power.

While there are other ‘small’ LLM resources available with better performance levels (e.g., Black et al., 2021, Burkov, 2025, Karpathy, 2024, 2025, Raschka, 2024), the purpose of Seedling is different to these resources. Seedling is a project, not a product.

The project purpose is to show that such models can be coded from the ground-up by psychological researchers on small budgets and to encourage psychometricians to study these models. If you can design, run, and summarize Monte Carlo studies with IRT or SEM, you can very likely build LLM models too.

Next steps

The goal of this section has been to lower the barrier for psychologists wanting to experiment with the inner workings of LLMs. While Seedling is small and emergent behavior tends to appear in much larger models, the small scale, single script nature of Seedling makes it accessible to psychometricians. Next steps could include modularizing the code like Andrej Karpathy’s excellent examples and trying optimizations that improve performance while maintaining the low compute budget.

Potential optimizations might include adjusting the learning rate and attempting learning rate restarts. The use of larger or alternative data sources with a similar or higher learning rate (to break plateaus), a continuation of pre-training, could also be attempted. Other optimizations could include changing the vocab size to the nearest multiple of 64 after 50k to ensure tensor alignment, and finally, expanding the architecture deeper (more layers) and wider (more dimensions), both with and without greater data exposure.

Once quantitative (stable loss) and qualitative (coherent text completions) convergence is achieved, it is common to fine tune models. Fine tuning is essentially the same process but does not start from zero, it focuses on adapting model behavior for specific tasks using smaller and more carefully curated data sets with lower learning rates.

Once improvements are realized on subsequent training runs, additional levels of explanation referenced in the introduction might be explored, corresponding to Marr’s algorithmic and implementation levels, and study issues like LLM alignment. However, for the purposes of this book, we now move to discussing the use of these models in psychological measurement.

Next page

Moral foundations as a use case

Last page

Encoder-decoder architectures

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).