- Why we tokenize
- Tokens as model vocabulary
- Byte Pair Encoding (BPE)
- BPE for Twinkle twinkle little star
- Other methods
- References
Why we tokenize
When faced with text for analysis, we must convert the text into numbers for the LLM. By breaking the text into smaller units, called tokens, we create a finite vocabulary for the LLM to operate on. Tokens matter because they represent the corpus using a defined vocabulary.
If the tokenizer segments relevant terminology poorly (e.g., by fragmenting key personality terms into obscure subparts, such as “ps” and “ychology”), the model may find it harder to learn fluent and coherent patterns in the psychological domain.
This is why it is critical to know what goes into training data, something we know little about despite efforts by advocacy for transparency and vendors trying to increase transparency. Corpus composition influences model understanding and fairness.
Tokens as model vocabulary
The collection of tokens is called the vocabulary. The vocabulary is limited for efficiency and there are different approaches to determining how many tokens are required for a given corpus. For example, it is possible to use word level tokenization, although there are more efficient methods. One more efficient method that we soon discuss is Byte Pair Encoding (BPE).
Effective tokenization is important because tokens represent the computational unit of the model and impact its speed and memory use. The vocabulary also ensures consistency. The tokenizer consistently maps text to tokens, but variations in spacing, casing, or punctuation may produce different tokenizations of the same word form.
Byte Pair Encoding (BPE)
Today LLMs use multiple tokenisers, one common algorithm being Byte Pair Encoding (BPE). The original BPE algorithm was proposed by Gage (1994) as a simple technique for reducing file size. The idea was to replace byte pairs with single unused byte symbols. Repeatedly doing this compresses the size of storage required to preserve data. You later decompress the data with reference to a look up table.
In natural language processing, Sennrich et al. (2015) adapted the BPE algorithm is to instead build a unique, deterministic, sub-word vocabulary that effectively represents target vocabulary. Once the tokenization is complete, we can map the tokens to embeddings and other variables that the model can consistently use during training and inference. The other variable we related the token codes to in the case of LLMs is the model embeddings.
Embeddings are the learnable numeric representations of each of the tokens that are used during LLM training and inference. Tokens are assigned embeddings that are initialised with random normal values and updated during the learning process.
We will show this in the next section on learning where we run tokens forward through an LLM architecture. Before that, we show the result of BPE on a few lines of a familiar nursery rhyme to make the concepts clear.
Alert: Here I highlight the steps and computations of tokenization with Byte Pair Encoding using a nursery rhyme example to make the ideas concrete. These results are preliminary and will be validated carefully in due course.
BPE for Twinkle twinkle little star
Let’s tokenize the first lines of this nursery rhyme ‘Twinkle twinkle little star how i wonder what you are’. BPE starts with all individual characters and identifies the most frequent adjacent pairs. In this case the winner in step 1 is ‘e’ + </w>, which is an end of word token BPE uses to identify word boundaries.
Early BPE implementations used </w> to mark word boundaries, while modern libraries like SentencePiece often use the prefix symbol _ instead. In a next step, the algorithm identifies the most frequent remaining pair, in our case, it is l and e from the start of twinkle.
After each merge, the result becomes a single token that can participate in the pair identification. BPE continues merging until a predefined number of merges is reached or no frequent pairs remain. In this case, the table shows it takes eight steps.
These tokens are then initialised with random numbers according to the model hidden dimension and training begins. The final tokenised output becomes: T winkle</w> t winkle</w> l i tt le</w> s t ar , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> ar e</w>
Step | Pair | Freq | Token State |
0 | - | - | T w i n k l e </w> t w i n k l e </w> l i t t l e </w> s t a r , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> a r e </w> |
1 | e + </w> | 4 | T w i n k l e</w> t w i n k l e</w> l i t t l e</w> s t a r , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> a r e</w> |
2 | l + e</w> | 3 | T w i n k le</w> t w i n k le</w> l i t t le</w> s t a r , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> a r e</w> |
3 | w + i | 2 | T wi n k le</w> t wi n k le</w> l i t t le</w> s t a r , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> a r e</w> |
4 | n + k | 2 | T wi nk le</w> t wi nk le</w> l i t t le</w> s t a r , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> a r e</w> |
5 | nk + le</w> | 2 | T wi nkle</w> t wi nkle</w> l i t t le</w> s t a r , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> a r e</w> |
6 | wi + nkle</w> | 2 | T winkle</w> t winkle</w> l i t t le</w> s t a r , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> a r e</w> |
7 | t + t | 2 | T winkle</w> t winkle</w> l i tt le</w> s t a r , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> a r e</w> |
8 | a + r | 2 | T winkle</w> t winkle</w> l i tt le</w> s t ar , </w> h o w </w> I </w> w o n d e r </w> w h a t </w> y o u </w> ar e</w> |
Other methods
There are other tokenization methods that are used. WordPiece tokenization (Schuster & Nakajima, 2012) is used in BERT and many Google models, which adds pairings to the vocab that increase the likelihood the most, rather than using frequency.
The Unigram method (Kudo, 2018) instead works backward with many tokens and iteratively removes tokens that decrease the likelihood of the corpus until you hit the desired vocabulary size. All methods we have discussed are available in the SentencePience, a fast language agnostic C++ tokenizer library that you also access via a Python binding (wrapper).
Modern GPUs process data in fixed-size blocks (typically 64, 128, or 256 tokens), so aligning vocabulary size to these multiples ensures efficient memory access and parallel computation without wasted padding (e.g., a 30,000 token vocab would waste space in the last block, but 30,080 divides evenly by 128).
In practice, there are many pre-built tokenizer libraries available that you can apply to your own data that are battle-tested, but it is possible to create your own.
References
Gage, P. (1994). A new algorithm for data compression. C Users Journal, 12(2), 23–38.
Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959
Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
Schuster, M., & Nakajima, K. (2012). Japanese and Korean voice search. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Next page
How LLMs learn (2 of 4): forward pass
Last page
What are encoder-decoder architectures?
Home