LLM Emergence Explained: Psychometric Approaches to AI Capability Analysis

What is emergence?
Why is this relevant to psychometrics practitioners?
Prevailing views of emergence
Potential criticisms of evidence
Rigorous psychometric modeling of emergence

What is emergence?

Emergence in AI is where models like ChatGPT develop capabilities when they get past a certain size that weren’t present at smaller sizes. These capabilities are claimed to represent a phase transition to qualitatively new capabilities. Examples include 3-digit addition, which is not accurate at smaller model sizes or expected by extrapolation of smaller model performance.

LLM emergence was proposed by computer scientists in a famous paper by Wei et al in 2022 that’s been cited over 4000 times. They showed a variety of plots for capabilities where model size was on x-axes and accuracy on y-axes. Across many tasks, such as word unscrambling, accuracy is random for small and much more accurate for larger models.

Why is this relevant to psychometrics practitioners?

Emergence is highly relevant to psychometrics. If there was an emergent threshold it could create item security risks where new item types are solved through unexpected LLM reasoning paths. It may suggest the type of LLMs best suited for psychometric design tasks and could raise questions about assumptions of linear development of human capabilities

Prevailing views of emergence

The lay public may find the concept plausible based on their experiences with these impressive models. In the ML community, its existence looks to be the accepted view, if citation rates are considered a proxy for acceptance. Among ML methodology researchers the idea is controversial (e.g. Schaeffer et al 2025). Psychologists would likely want to see more sophisticated capability modeling before accepting qualitatively distinct capabilities have appeared.

Potential criticisms of evidence

The x-scaling in the Wei et al examples represent massive computational leaps, so the examples do not rule out these developments being enhancement of existing capabilities, rather than development of new ones. It is important to check if the capabilities are observable at in-between sizes and also with data saturated with examples relevant to the capability.

Wei et al presented 2d plots because data sets are often fixed size benchmark data sets. The x-axes are compute (FLOPS: floating point operations) and the y axes are accuracy of the different emergent capabilities. Ideally, however, 3d plots should be preferred where x is parameter count, y is a measure capability relevant example saturation of training data, and z is accuracy.

Evidence would also be more convincing if the new ability has suddenly emerged had high accuracy, but accuracy was far from perfect even for the largest models for many of the capabilities. It is important also to know about the architectural similarities and differences between the models of different scales between the absence and the appearance of the capabilities.

Rigorous psychometric modeling of emergence

But even if after all of this is examined and the emergence of the capability was sudden after addressing these points, psychologists would want more. There are unexplored foundational psychometric techniques that would help to decide whether the increased scale delivers genuinely distinct capabilities or simply enhances existing ones.

For example, we should check whether model scale is a moderator (ideally, continuous) of the parameters of a factor model describing the structure of the LLM capability, or whether a factor mixture model recommends more than one class with different measurement models across the classes. Such analyses must adjust for any non-independence due to repeated observations from the same LLM, due to the shared weights that generate responses.

These ideas for more rigorous tests of emergence would require replacing ad hoc tasks with psychometric scales and LLMs of many different sizes. These would be significant research undertakings requiring well resourced interdisciplinary research groups. Their practicality at this stage may prevent their investigation. It is important to be aware, nonetheless, of the ways these issues could be most thoroughly investigated.

References

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., ... & Fedus, W. (2022). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.

Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are emergent abilities of large language models a mirage? Advances in neural information processing systems, 36, 55565-55581.

Next page

Testing standards: Reliability and validity

Last page

Alignment: What it is, why it matters, how to do it

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

Emergent LLM capabilities

What is emergence?

Why is this relevant to psychometrics practitioners?

Prevailing views of emergence

Potential criticisms of evidence

Rigorous psychometric modeling of emergence