- Fine-tuning LLMs: Adapters, LoRA, QLoRA
- Fine-tuning and pre-training: core differences
- Parameter efficient fine-tuning (PEFT) methods
- Fine tuning with adapters
- Low Rank Adaptation (LoRA) fine-tuning
- Quantized Low Rank Adaptation (QLoRA)
- Summary
Fine-tuning LLMs: Adapters, LoRA, QLoRA
Approaches to LLM training can be considered under two broad categories, pre-training and fine-tuning. Pre-training is the process where LLM models learn broad language understanding from vast amounts of text data. Pre-training an LLM to the standard of widely used commercial models costs many millions of dollars and needs vast amounts of data and extensive compute. As a result, most practical work with LLMs focuses on fine-tuning. Fine-tuning means adapting existing pre-trained models rather than training new ones from scratch. Note that RHLF and RAG will be treated separately in this book.
The ‘How LLMs learn’ section and the Seedling case study demonstrate the pre-training process on a small scale. However, the noted financial, compute and data costs, and technical requirements of pre-training place it beyond reach in many situations. Another solution is required to adapt LLMs for specialised tasks. Fine-tuning, the second major category of LLM learning happens when a pre-trained LLM is adapted for specific tasks or domains (e.g. psychological, legal, or medical text). Whereas pre-training datasets are commonly terabytes in size, fine-tuning datasets are commonly measured in gigabytes. Depending on the approach adopted, fine-tuning may be within the reach of businesses.
Fine-tuning and pre-training: core differences
The processes involved in the fine-tuning are the same as core stages in pre-training, albeit small changes such as task specific loss functions may be included. The steps are covered under “How LLMs learn” and include tokenization, forward passes, back propagation, and optimization. However, the fine-tuning process starts using the parameters from the end of pre-training. Because the model has already undergone extensive training, the number of steps required for the fine-tuning are much lower. From this point, there are three other key differences.
First, the text domain data that the model is trained on differs. Data sets are high quality, carefully curated language from a target domain. Second, the learning rate at which parameters are updated differs. Learning rates are significantly lower in fine-tuning than they are in pre-training. This is to avoid disrupting the model’s learning from pre-training, referred to as catastrophic forgetting. Finally, the parameters that are updated can be the same parameters as in pre-training, subsets of parameters, or newly introduced parameters. Fine-tuning all parameters is costly with large models and other solutions are often required. The latter two approaches are considered parameter efficient training (PEFT) methods. Let’s explore this point about parameters further in the next section.
Contrasting pre-training and fine-tuning.
Process input | Pre-training | Fine-tuning |
Data | Broad, general corpus | Narrow, domain-specific |
Learning rate | Relatively higher | Much lower |
Parameters updated | All parameters | Some or all, depending on approach |
Parameter efficient fine-tuning (PEFT) methods
In contrast to adjusting all parameters in a neural network with specialised data and a lower learning rate, it is possible to insert new weight matrices at strategic points in the transformer architecture and only update these matrices during optimization, leaving original parameter matrices fixed. These new weight matrices inserted into the model are called adapters. This allows models to specialize for new tasks without disturbing the original knowledge from pre-training.
Adapters provide new learned information into the transformer architecture without changing the base model’s weights from pre-training. Alternative methods that actually interact with the parameter matrices learned during pre-training like attention matrices are also possible. These models including Low Rank Adaptation (LoRA) and Quantized Low Rank Adaptation (QLoRA). We discuss each of these approaches now. Taken together these methods represent the main strategies for fine-tuning LLMs within a manageable computational budget.
Fine tuning with adapters
Adapters are small additional layers inserted into the model and were proposed by Houlsby et al (2019). Adapters are strategically inserted in multiple transformer layers, typically after both the attention and feedforward sublayers, before their respective residual connections, to enhance task-specific learning without disrupting earlier representations. Adapters are not inserted in parts of the model that might seriously disrupt earlier learning, such as at the embeddings or positional encodings stage, or after non-learned matrix components (e.g. the final SoftMax).
Adapters implement a ‘bottleneck’ in the transformer architecture that projects the feed forward output down into a lower dimensional space using one matrix, pass it through a non-linear activation like ReLU or GELU, and then project it back up to the original hidden dimension. The bottleneck size is chosen based on the complexity of the training task. the inner dimension often ranges between 8 and 64. Larger bottleneck sizes mean more memory but are also more computationally expensive. In practice, size is chosen by examining the effect of bottleneck size experiments on validation loss.
These two projection matrices are often denoted and which reduces the dimensionality of the feed-forward output, and which restores it to the model’s hidden size after the non-linearity. The down projection matrix is initialized with random values often scaled down, (e.g., ~N(0, 10⁻³), with scaling varying with model size). The up-projection matrix is initialised with zeros before learning occurs so that there is no change to the forward pass.
The adapter’s down-projection matrix has dimensions hidden size × bottleneck size to compress the input, and the up-projection matrix has size bottleneck size × hidden size to expand the output again. Adapters operate outside the main matrices, processing outputs rather than internal attention projections. During fine-tuning, only the adapter parameters are updated, while the original model weights remain frozen, allowing efficient model adaptation with minimal computation.
In practice, the placement of adapters can be handled automatically by software packages that have been optimized to place them at positions known to work well. However, these defaults can be overridden to place adapters at any point in the model. The most common package is adapter-transformers from Hugging Face, though frameworks such as PEFT, OpenDelta, and AdapterHub also support adapter-based fine-tuning. Adapter-based fine-tuning is a standard part in many modern transformer workflows.
Low Rank Adaptation (LoRA) fine-tuning
LoRA, proposed by Hu et al. (2022), is another parameter efficient fine-tuning method that works conceptually similarly to adapters. Like adapters, LoRA adjustments introduce parallel processes that adjust the outputs of the linear projections while leaving the original weights frozen. The key difference is that while adapters insert learned matrices outside of the core transformer operations to process sublayer outputs, LoRA applies additive low-rank corrections to the outputs of weight matrices during the forward pass, also leaving the original weights frozen. LoRA is commonly applied to the , and projection matrices. These correspond to the weight matrices , and inside the self-attention blocks, which project the hidden states into the query and value spaces used for attention calculations.
The process a defines an implied correction matrix as the product of two learnable submatrices, and . As with adapters, one of these matrices is used to down project the weight matrix into lower dimensions. The correction matrix is never physically defined or stored using LoRA. It is calculated on the fly during the forward pass for computational efficiency. The first matrix from the correction decomposition is initialized as random normal. The other matrix is used to expand the new learned matrix back up to its original hidden dimension. This matrix is zero initialized.
Though not originally proposed by Hu et al., recent LoRA variant approaches add a non-linear activation such as ReLU or GELU between these projections. This helps the model capture non-linear relationships in the learned correction. ReLU/GELU activation means LoRA can adapt its correction based on the input, instead of applying the same change uniformly across all weights. After fine-tuning, the low-rank updates can be either merged into the base model weights or kept as separate parallel modules that can be swapped for different tasks.
Quantized Low Rank Adaptation (QLoRA)
QLoRA, proposed by Dettmers (2023), extends the capabilities of the LoRA method to fine-tuning very large language models using significantly less GPU memory. With QLoRA models with tens of billions of parameters can be fine-tuned on single GPUs. It achieves this by quantizing the frozen base model weights to 4-bit precision, while keeping the low-rank LoRA adapters in higher precision (typically 16-bit).
A bit is the smallest unit of data, representing a 0 or 1. In this case 16 bits means each value is stored using 16 binary digits, allowing higher precision, while 4 bits means each value is stored using 4 digits leading to less precision. QLoRA stores the base model weights in 4-bit precision to save memory and then temporarily dequantizes them during the forward pass.
The quantized weights remain frozen throughout training and only the low-rank LoRA adapters are updated during backpropagation and optimization. QLoRA in fact employs double quantization to further reduce memory usage by quantizing the quantization constants themselves to 8-bit precision. It also uses paged optimizers to offload optimizer states to CPU or NVMe storage for efficient fine-tuning billions of parameter models on a single GPU.
Summary
The methods discussed represent the main parameter-efficient fine-tuning strategies in current use, though other variants such as BitFit (bias tuning), LayerNorm tuning, or approaches that modify residual connections also exist. For examples of the models discussed here in the applied psychological domain, readers can see Binz et al (2025) or Hommel et al. (2022). Separate sections in this book over RLHF and RAG.
Comparisons | Adapters | LoRA | QLoRA |
Where | After sublayer outputs | Inside attention projections (Q, V) | Same as LoRA |
What changes | The output representation | Alters what projection does during processing | Same as LoRA |
Technique | Down-project → nonlinearity → up-project | Down-project → up-project (optional nonlinearity) | Down-project → up-project (optional nonlinearity) on 4-bit quantized weights |
References
Binz, M., Akata, E., Bethge, M., Brändle, F., Callaway, F., Coda-Forno, J., Dayan, P., Demircan, C., Eckstein, M. K., Éltető, N., Griffiths, T. L., Haridi, S., Jagadish, A. K., Ji-An, L., Kipnis, A., Kumar, S., Ludwig, T., Mathony, M., Mattar, M., Modirshanechi, A., Nath, S. S., Peterson, J. C., Rmus, M., Russek, E. M., Saanum, T., Schubert, J. A., Schulze Buschoff, L. M., Singhi, N., Sui, X., Thalmann, M., Theis, F., Truong, V., Udandarao, V., Voudouris, K., Wilson, R., Witte, K., Wu, S., Wulff, D. U., Xiong, H., & Schulz, E. (2025). Centaur: A foundation model of human cognition(Version 3). arXiv. https://arxiv.org/abs/2410.20268
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023).https://arxiv.org/abs/2305.14314
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-efficient transfer learning for NLP. In International conference on machine learning (pp. 2790-2799). PMLR.
Hommel, B. E., Wollang, F.-J. M., Kotova, V., Zacher, H., & Schmukle, S. C. (2022). Transformer-based deep neural language modeling for construct-specific automatic item generation. Psychometrika, 87(2), 749–772. https://doi.org/10.1007/s11336-021-09823-9
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, L., Wang, W., & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations (ICLR 2022).https://arxiv.org/abs/2106.09685
Next section
Alignment: what it is, why it matters, how to do it
Previous section
Seedling: LLM training walkthrough
Return home
Psychometrics.ai
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).