How LLMs learn (1 of 4): Tokenization

Why we tokenize
Tokens as model vocabulary
Byte Pair Encoding (BPE)
BPE for Twinkle twinkle little star
Other methods
References

Why we tokenize

When faced with text for analysis, the LLM must convert it into numbers for analysis. By breaking the text into smaller units, called tokens, we can create a finite vocabulary for the LLM to operate on. Tokens matter because they represent the corpus using a defined vocabulary and provide a bridge to numerical embeddings that the LLM uses.

If the tokenizer segments relevant terminology poorly (e.g., by fragmenting a term like “alexithymia” into obscure subparts, such as “alex” and “ithymia”), the model may find it harder to learn fluent and coherent patterns in the psychological domain. Modern tokenizers are generally able to avoid these situations but if you build your own care is required.

This is one of several reasons why it is critical to know what goes into training data, something we know little about despite efforts by advocacy for transparency and vendors trying to increase transparency (another we want to know training data is that it can impact bias). To put it another way, corpus composition influences model understanding and fairness.

Tokens as model vocabulary

The collection of tokens is called the vocabulary. The vocabulary is limited in size for efficiency and there are different approaches to determining how many tokens are required for a given corpus. For example, it is possible to use word level tokenization, although there are more efficient methods. One more efficient method that we soon discuss is Byte Pair Encoding (BPE).

Effective tokenization is important because tokens represent the computational unit of the model and impact its speed and memory use. The vocabulary also ensures consistency. The tokenizer consistently maps text to tokens, but variations in spacing, casing, or punctuation may produce different tokenizations of the same word form.

Large vocabularies (e.g., 100k+) shorten sequences and keep rare words as single tokens, but inflate parameters and compute as matrix sizes increase with vocab size. Smaller vocabularies (e.g., 30k–50k) shrink matrices, making each step cheaper, but they also lengthen sequences as words are split into more subword tokens and they over-split rare words.

Byte Pair Encoding (BPE)

Today LLMs use multiple tokenisers, one common algorithm being Byte Pair Encoding (BPE). The original BPE algorithm was proposed by Gage (1994) as a simple technique for reducing file size. The idea was to replace byte pairs with single unused byte symbols. Repeatedly doing this compresses the size of storage required to preserve data. You later decompress the data with reference to a look up table.

In natural language processing, Sennrich et al. (2015) adapted the BPE algorithm is to instead build a unique, deterministic, sub-word vocabulary that effectively represents the target vocabulary. Once the tokenization is complete, we can map the tokens to embeddings and positional encodings that the model can consistently use during training and inference.

Embeddings are the learnable numeric representations of each of the tokens that are used during LLM training and inference. Tokens are then assigned embeddings that are initialised with random normal values and updated during the learning process. We will show this in the next section on learning where we run tokens forward through an LLM architecture. Before that, we show the result of BPE on a few lines of a familiar nursery rhyme to make the concepts clear.

💡

Alert: Here I highlight the steps and computations of tokenization with Byte Pair Encoding using a simple rhyme example to make the ideas concrete. These results are preliminary and will be validated carefully in due course.

BPE for Twinkle twinkle little star

Let’s tokenize the first couple of lines of the lullaby ‘Twinkle twinkle little star’. BPE starts with all individual characters and identifies the most frequent adjacent character pairs. In this case the winner in step 1 is ‘e’ + </w>, which is an end of word token BPE uses to identify word boundaries.

Early BPE implementations used </w> to mark word boundaries, while modern libraries like SentencePiece often use the prefix symbol _ instead. In a next step, the algorithm identifies the most frequent remaining pair, in our case, it is l and e (e.g., from twinkle).

After each merge, the result becomes a new single token that can participate in the pair identification. BPE continues merging until a predefined number of merges is reached or no frequent pairs remain. In this case, the table shows it takes seven steps.

These tokens are then initialised with random numbers according to the model hidden dimension and the training begins. The final tokenised output for our example becomes: T winkle</w> t winkle</w> l i tt le</w> s t ar , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> ar e</w>

It is important to differentiate the construction of a tokenizer’s vocabulary, which this example demonstrates, and the application of a tokenizer to tokenize a text sequence. For the latter situation, a tokenizer built with BPE on a larger corpus may lead to a different tokenization of the sequence compared to the illustration I now present here.

Step	Pair	Freq	Token State
0	-	-	T w i n k l e </w> t w i n k l e </w> l i t t l e </w> s t a r , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> a r e </w>
1	e + </w>	4	T w i n k l e</w> t w i n k l e</w> l i t t l e</w> s t a r , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> a r e</w>
2	l + e</w>	3	T w i n k le</w> t w i n k le</w> l i t t le</w> s t a r , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> a r e</w>
3	w + i	2	T wi n k le</w> t wi n k le</w> l i t t le</w> s t a r , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> a r e</w>
4	n + k	2	T wi nk le</w> t wi nk le</w> l i t t le</w> s t a r , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> a r e</w>
5	nk + le</w>	2	T wi nkle</w> t wi nkle</w> l i t t le</w> s t a r , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> a r e</w>
6	wi + nkle</w>	2	T winkle</w> t winkle</w> l i t t le</w> s t a r , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> a r e</w>
7	a + r	2	T winkle</w> t winkle</w> l i t t le</w> s t ar , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> ar e</w>

Other methods

There are other tokenization methods that are used. WordPiece tokenization (Schuster & Nakajima, 2012) is used in BERT and many Google models, which adds pairings to the vocab that increase the likelihood the most, rather than using frequency. The Unigram method (Kudo, 2018) instead works backward with many tokens and iteratively removes tokens that decrease the likelihood of the corpus until you hit the desired vocabulary size. All methods we have discussed are available in the SentencePiece, a fast language agnostic C++ tokenizer library that you also access via Python bindings (wrapper).

Modern GPUs process data in fixed-size blocks (typically 64, 128, or 256 tokens), so aligning vocabulary size to these multiples ensures efficient memory access and parallel computation without wasted padding (e.g., a 30,000 token vocab would waste space in the last block, but 30,080 divides evenly by 128). In practice, there are many pre-built tokenizer libraries available that you can apply to your own data that are well-tested (e.g., OpenAI’s TikToken, but it is possible to create your own.

References

Gage, P. (1994). A new algorithm for data compression. C Users Journal, 12(2), 23–38.

Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959

Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.

Schuster, M., & Nakajima, K. (2012). Japanese and Korean voice search. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Next page

How LLMs learn (2 of 4): Forward pass

Last page

Five criteria for choosing a language model in AI psychometrics

Home

Psychometrics.ai