- Reasoning LLMs
- Origins of reasoning models
- Four methods for improving LLM reasoning
- Chain of Thought prompting
- Fine-tuning on correct reasoning traces
- Reinforcement learning
- Compute scaling
- Optimization objectives
- Verification
- References
Reasoning LLMs
Reasoning (or thinking) models are LLMs designed to produce intermediate reasoning steps. They have become a central focus of commercial LLM vendors. In the context of psychometric work, their utility beyond standard LLMs as described in earlier sections of this book is unproven.
Nonetheless, In psychometrics, where precision or its absence has high stakes consequences, and where measurement accuracy, adverse impact, and performance prediction are often trade-offs, understanding the relative contribution of LLM reasoning processes to each of these outcomes is currently under-explored and important.
Origins of reasoning models
An influential paper by Wei et al. (2022) showed that asking language models to show their reasoning improved performance on certain benchmark tasks, such math and word problems, common sense reasoning, and symbolic and logical reasoning. They named this chain of thought (CoT) prompting.
Subsequent work presented in DeepSeek (2025) demonstrated that reinforcement learning can incentivize the emergence of these reasoning behaviors, even without supervised reasoning traces. These LLM enhancements may have implications for methods involving LLMs in other sections of the book, such as AI scoring.
Four methods for improving LLM reasoning
There are four distinct methods for improving reasoning in LLMs: prompting (inference-time), supervision (training-time fine-tuning), optimization via rewards (RL), and scaling test time compute. All approaches involve modifying the model parameters during training or sampling from a distribution over reasoning trajectories at inference time. They change the model parameters during training or alter inference computation and token selection.
Chain of Thought prompting
Chain-of-Thought prompting elicits intermediate reasoning steps at inference time, effectively increasing compute by decomposing problems into multiple steps, rather than producing an answer in a single step without requiring any changes to the underlying parameters. It is most effective in large models where such reasoning abilities are latent.
Fine-tuning on correct reasoning traces
Second, fine-tuning on known correct reasoning traces in mathematical reasoning tasks improves accuracy on benchmark tests. This ability can also generalize to non-mathematical problems, but it relies on high-quality annotated reasoning traces and these may constrain the model to human-like reasoning patterns (suggested in DeepSeek, 2025).
Reinforcement learning
Reinforcement learning allows exploration of different reasoning paths, where responses that lead to correct outcomes receive a reward signal (often verified either by a rule-based checker in math or coding domains or a learned reward model) and the policy model is updated accordingly. This lets the model discover reasoning strategies beyond the training data by optimizing for task success rather than imitating human reasoning traces.
Compute scaling
A related line of work shows that reasoning can also be improved at inference time if additional compute is allocated. This can be achieved with methods including sampling several candidate solutions or response iteration for refinement. These methods can lead to smaller models outperforming larger ones given comparable budgets and without additional training.
Optimization objectives
A central challenge is that models can arrive at correct answers through incorrect reasoning (Lightman et al., 2023), meaning that outcome-based evaluation may fail to detect errors in the reasoning process itself. A key design choice with reasoning LLMs is therefore how reasoning is evaluated.
Early verification approaches by Lightman et al. (2023) addressed this by verifying each reasoning step individually (process supervision), while later RL methods by DeepSeek relied on easier to implement outcome-based rewards that check only the final answer (outcome supervision). The choice trades off micro-level accuracy for scalability.
Lightman et al. (2023) found that models trained with outcome supervision may produce correct answers despite imperfect intermediate reasoning and that process supervision produces more reliable reward models. In practice, however, DeepSeek demonstrated that outcome rewards, albeit less precise, are significantly easier to scale. They can produce strong performance when combined with reinforcement learning.
In other words, these methods differ primarily in their training objective. Supervised fine-tuning maximizes likelihood over human-provided reasoning traces. Reinforcement learning instead optimizes a reward function defined over either final answers (outcome supervision) or intermediate steps (process supervision).
Verification
Across these methods, verification occurs by evaluating model outputs, either at the level of individual reasoning steps or final answers using rule-based checks or learned reward models. Rule-based checks provide deterministic signals by comparing outputs against ground-truth answers or executable tests (e.g., exact numeric answers in math or passing unit tests in code).
This makes rule based checkers simple and reliable but limited to domains with verifiable solutions. Learned reward models on the hand are trained to approximate human judgments of quality or correctness. This allows evaluation in open-ended settings but also introduces noise and potential biases from the training data.
References
DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948. https://doi.org/10.48550/arXiv.2501.12948
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., & Cobbe, K. (2023). Let’s verify step by step. arXiv preprint arXiv:2305.20050. https://doi.org/10.48550/arXiv.2305.20050
Snell, C., Lee, J., Xu, K., & Kumar, A. (2024). Scaling LLM test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314. https://doi.org/10.48550/arXiv.2408.03314
Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. "Chain-of-thought prompting elicits reasoning in large language models." Advances in neural information processing systems35 (2022): 24824-24837.
Next section
Testing standards: Reliability and validity
Last section
Return home
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).