What are reasoning (thinking) LLMs?

Reasoning LLMs
Origins of reasoning models
Four methods for improving LLM reasoning
Chain of Thought prompting
Fine-tuning on correct reasoning traces
Reinforcement learning
Compute scaling
Optimization objectives
Verification
References

Reasoning LLMs

Reasoning (or thinking) models are LLMs designed to produce intermediate reasoning steps. These models have become a central focus of LLM vendors. In a highly cited article, Shojaee et al. (2025) found that reasoning models showed performance improvements on reasoning benchmarks, but did not always perform better than non-reasoning LLMs.

They found that in low complexity tasks, non-reasoning LLMs can outperform reasoning LLMs. A scaling effect was also reported where reasoning effort increases with complexity up to a point, then declines despite available token budgets. Finally, both types of models exhibited performance collapse at high complexity.

In the context of psychometric work, the utility of reasoning models beyond the standard LLMs, as described in earlier sections, of this book is unproven but their promise should not be dismissed given the evidence on benchmarks improvements in math and logical reasoning domains.

In psychometrics, where precision or its absence can have high-stakes consequences, measurement accuracy, adverse impact, and performance prediction often involve trade-offs. Understanding how LLM reasoning affects accuracy, impact, and parameter and score prediction remains underexplored and is an important area of study.

Origins of reasoning models

An influential paper by Wei et al. (2022) showed that asking language models to show their reasoning improved performance on certain benchmark tasks, such math and word problems, common sense reasoning, and symbolic and logical reasoning. They named this chain of thought (CoT) prompting.

Subsequent work presented in DeepSeek (2025) demonstrated that reinforcement learning can incentivize the emergence of these reasoning behaviors, even without supervised reasoning traces. These LLM enhancements may have implications for methods involving LLMs in other sections of the book, such as AI scoring.

Four methods for improving LLM reasoning

There are four distinct methods for improving reasoning in LLMs: prompting (inference-time), supervision (training-time fine-tuning), optimization via rewards (RL), and scaling test time compute. All approaches involve modifying the model parameters during training or sampling from a distribution over reasoning trajectories at inference time. They change the model parameters during training or alter inference computation and token selection.

Chain of Thought prompting

Chain-of-Thought prompting elicits intermediate reasoning steps at inference time, effectively increasing compute by decomposing problems into multiple steps, rather than producing an answer in a single step without requiring any changes to the underlying parameters. It is most effective in large models where such reasoning abilities are latent.

Fine-tuning on correct reasoning traces

Second, fine-tuning on known correct reasoning traces in mathematical reasoning tasks improves accuracy on benchmark tests. This ability can also generalize to non-mathematical problems, but it relies on high-quality annotated reasoning traces and these may constrain the model to human-like reasoning patterns (suggested in DeepSeek, 2025).

Reinforcement learning

Reinforcement learning allows exploration of different reasoning paths, where responses that lead to correct outcomes receive a reward signal (often verified either by a rule-based checker in math or coding domains or a learned reward model) and the policy model is updated accordingly. This lets the model discover reasoning strategies beyond the training data by optimizing for task success rather than imitating human reasoning traces.

Compute scaling

A related line of work shows that reasoning can also be improved at inference time if additional compute is allocated. This can be achieved with methods including sampling several candidate solutions or response iteration for refinement. These methods can lead to smaller models outperforming larger ones given comparable budgets and without additional training.

Optimization objectives

A central challenge is that models can arrive at correct answers through incorrect reasoning (Lightman et al., 2023), meaning that outcome-based evaluation may fail to detect errors in the reasoning process itself. A key design choice with reasoning LLMs is therefore how reasoning is evaluated.

Early verification approaches by Lightman et al. (2023) addressed this by verifying each reasoning step individually (process supervision), while later RL methods by DeepSeek relied on easier to implement outcome-based rewards that check only the final answer (outcome supervision). The choice trades off micro-level accuracy for scalability.

Lightman et al. (2023) found that models trained with outcome supervision may produce correct answers despite imperfect intermediate reasoning and that process supervision produces more reliable reward models. In practice, however, DeepSeek demonstrated that outcome rewards, albeit less precise, are significantly easier to scale. They can produce strong performance when combined with reinforcement learning.

In other words, these methods differ primarily in their training objective. Supervised fine-tuning maximizes likelihood over human-provided reasoning traces. Reinforcement learning instead optimizes a reward function defined over either final answers (outcome supervision) or intermediate steps (process supervision).

Verification

Across these methods, verification occurs by evaluating model outputs, either at the level of individual reasoning steps or final answers using rule-based checks or learned reward models. Rule-based checks provide deterministic signals by comparing outputs against ground-truth answers or executable tests (e.g., exact numeric answers in math or passing unit tests in code).

This makes rule based checkers simple and reliable but limited to domains with verifiable solutions. Learned reward models on the hand are trained to approximate human judgments of quality or correctness. This allows evaluation in open-ended settings but also introduces noise and potential biases from the training data.

References

DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948. https://doi.org/10.48550/arXiv.2501.12948

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., & Cobbe, K. (2023). Let’s verify step by step. arXiv preprint arXiv:2305.20050. https://doi.org/10.48550/arXiv.2305.20050

Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., & Farajtabar, M. (2025). The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity. arXiv preprint arXiv:2506.06941.

Snell, C., Lee, J., Xu, K., & Kumar, A. (2024). Scaling LLM test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314. https://doi.org/10.48550/arXiv.2408.03314

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. "Chain-of-thought prompting elicits reasoning in large language models." Advances in neural information processing systems35 (2022): 24824-24837.

Next section

Testing standards: Reliability and validity

Last section

Emergent LLM capabilities

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).