References

References

Part 1: AI Psychometric foundations

Black, S., Gao, L., Wang, P., Leahy, C., & Biderman, S. (2021). GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow. https://github.com/EleutherAI/gpt-neo

Brickman, J., Gupta, M., & Oltmanns, J. R. (2025). Large language models for psychological assessment: A comprehensive overview. Advances in Methods and Practices in Psychological Science8(3), 25152459251343582.

Burkov, A. (2025). The hundred-page language models book: Hands-on with PyTorch. Andriy Burkov.

Canagasuriam, D., & Lukacik, E. R. (2025). ChatGPT, can you take my job interview? Examining artificial intelligence cheating in the asynchronous video interview. International Journal of Selection and Assessment33(1), e12491.

Casabianca, J. M., McCaffrey, D. F., Johnson, M. S., Alper, N., & Zubenko, V. (2025). Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications. arXiv preprint arXiv:2501.02334.

Chiappone, F., Marocco, D., & Milano, N. (2026). Large Language Models as Simulative Agents for Neurodivergent Adult Psychometric Profiles. arXiv preprint arXiv:2601.15319.

Dai, W., Tsai, Y. S., Lin, J., Aldino, A., Jin, H., Li, T., ... & Chen, G. (2024). Assessing the proficiency of large language models in automatic feedback generation: An evaluation study. Computers and Education: Artificial Intelligence7, 100299.

Financial Times. (2024, Oct 20). How AI groups are infusing their chatbots with personality. Financial Times. 

Gage, P. (1994). A new algorithm for data compression. C Users Journal, 12(2), 23–38.

Guenole, N., Samo, A., & Sun, T. (2024). Pseudo-Discrimination Parameters from Language Embeddings. OSF Preprint. https://osf.io/preprints/psyarxiv/9a4qx_v1

Guenole, N., D'Urso, E. D., Samo, A., Sun, T., & Haslbeck, J. (2025). Enhancing Scale Development: Pseudo Factor Analysis of Language Embedding Similarity Matrices. OSF Preprint. https://osf.io/preprints/psyarxiv/vf3se_v2

Hernandez, I., & Nie, W. (2023). The AI‐IP: Minimizing the guesswork of personality scale item development through artificial intelligence. Personnel Psychology, 76(4), 1011-1035.

Hommel, B. E., Wollang, F. J. M., Kotova, V., Zacher, H., & Schmukle, S. C. (2022). Transformer-based deep neural language modeling for construct-specific automatic item generation. Psychometrika87(2), 749-772.

Hickman, L., Bosch, N., Ng, V., Saef, R., Tay, L., & Woo, S. E. (2022). Automated video interview personality assessments: Reliability, validity, and generalizability investigations. Journal of Applied Psychology107(8), 1323.

Hussain, Z., Binz, M., Mata, R., & Wulff, D. U. (2025). A tutorial on open-source large language models for behavioral science. Behavior Research Methods56(8), 8214-8237.

Jung, J. Y., Tyack, L., & von Davier, M. (2024). Combining machine translation and automated scoring in international large-scale assessments. Large-scale Assessments in Education12(1), 10.

Karpathy, A. nanoGPT: The simplest, fastest repository for training/finetuning medium-sized gpts. https://github.com/karpathy/nanoGPT/ tree/master, 2024.

Karpathy, A. (2025, February 5). Deep Dive into LLMs like ChatGPT. YouTube. http://www.youtube.com/watch?v=7xTGNNLPyMI

Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. *arXiv preprint arXiv:1804.10959

Laverghetta Jr, A., Nighojkar, A., Mirzakhalov, J., & Licato, J. (2021, July). Predicting human psychometric properties using computational language models. In The Annual Meeting of the Psychometric Society (pp. 151-169). Cham: Springer International Publishing.

LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. R. (1998). Efficient backprop. In Neural networks: Tricks of the trade (pp. 9-50). Springer.

Maeda, H. (2025). Field-testing multiple-choice questions with AI examinees: English grammar items. Educational and Psychological Measurement85(2), 221-244.

Maharjan, J., Jin, R., Zhu, J., & Kenne, D. (2025). Psychometric Evaluation of Large Language Model Embeddings for Personality Trait Prediction. Journal of Medical Internet Research27, e75347.

Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. W. H. Freeman.

Milano, N., Ponticorvo, M., & Marocco, D. (2026). Human expertise and large language model embeddings in the content validity assessment of personality tests. Educational and Psychological Measurement86(1), 30-53.

Ormerod, C., Jafari, A., Lottridge, S., Patel, M., Harris, A., & van Wamelen, P. (2021). The effects of data size on Automated Essay Scoring engines. arXiv preprint arXiv:2108.13275.

Pellert, M., Lechner, C. M., Wagner, C., Rammstedt, B., & Strohmaier, M. (2024). Ai psychometrics: Assessing the psychological profiles of large language models through psychometric inventories. Perspectives on Psychological Science19(5), 808-826.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog1(8), 9.

Raschka, S. (2024). Build a Large Language Model (From Scratch). Simon and Schuster.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature323(6088), 533-536.

Russell-Lasalandra, L. L., Christensen, A. P., & Golino, H. (2024, September). Generative Psychometrics via AI-GENIE: Automatic Item Generation and Validation via Network-Integrated Evaluation.

Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. *arXiv preprint arXiv:1508.07909.

Schuster, M., & Nakajima, K. (2012). Japanese and Korean voice search. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Shrestha, I., Tay, L., & Srinivasan, P. (2025). Robust Bias Detection in MLMs and its Application to Human Trait Ratings. arXiv preprint arXiv:2502.15600.

Scott, D., & Suppes, P. (1957). Foundational aspects of theories of measurement. The journal of symbolic logic23(2), 113-128.

Speer, A. B., Delacruz, A. Y., Chawota, T. A., Perrotta, J., & Rudolph, C. W. (2015). Unpacking the Validity of Open-Ended Personality Assessments Using Fine-Tuned Large Language Models. Organizational Research Methods, 10944281251413746.

Suppes, P. (2013). Studies in the Methodology and Foundations of Science: Selected Papers from 1951 to 1969 (Vol. 22). Springer Science & Business Media.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. (2017). Attention is all you need. Advances in neural information processing systems 30.

von Davier, A. A. (2017). Computational psychometrics in support of collaborative educational assessments. Journal of Educational Measurement, 54(1), 3-11.

von Davier, M. (2019). Training Optimus prime, MD: Generating medical certification items by fine-tuning OpenAI's gpt2 transformer model. arXiv preprint arXiv:1908.08594.

Wulff, D. U., & Mata, R. (2025). Wulff, D. U., & Mata, R. (2025). Semantic embeddings reveal and address taxonomic incommensurability in psychological measurement. Nature Human Behaviour, 1-11.

Zhao, H., Chen, H., Yang, F., Liu, N., Deng, H., Cai, H., ... & Du, M. (2024). Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology15(2), 1-38.

Part 2: How LLMs learn

Binz, M., Akata, E., Bethge, M., Brändle, F., Callaway, F., Coda-Forno, J., Dayan, P., Demircan, C., Eckstein, M. K., Éltető, N., Griffiths, T. L., Haridi, S., Jagadish, A. K., Ji-An, L., Kipnis, A., Kumar, S., Ludwig, T., Mathony, M., Mattar, M., Modirshanechi, A., Nath, S. S., Peterson, J. C., Rmus, M., Russek, E. M., Saanum, T., Schubert, J. A., Schulze Buschoff, L. M., Singhi, N., Sui, X., Thalmann, M., Theis, F., Truong, V., Udandarao, V., Voudouris, K., Wilson, R., Witte, K., Wu, S., Wulff, D. U., Xiong, H., & Schulz, E. (2025). Centaur: A foundation model of human cognition(Version 3). arXiv. https://arxiv.org/abs/2410.20268

DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948. https://doi.org/10.48550/arXiv.2501.12948

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023).https://arxiv.org/abs/2305.14314

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-efficient transfer learning for NLP. In International conference on machine learning (pp. 2790-2799). PMLR.

Hommel, B. E., Wollang, F.-J. M., Kotova, V., Zacher, H., & Schmukle, S. C. (2022). Transformer-based deep neural language modeling for construct-specific automatic item generation. Psychometrika, 87(2), 749–772. https://doi.org/10.1007/s11336-021-09823-9

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, L., Wang, W., & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations (ICLR 2022).https://arxiv.org/abs/2106.09685

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., & Cobbe, K. (2023). Let’s verify step by step. arXiv preprint arXiv:2305.20050. https://doi.org/10.48550/arXiv.2305.20050

Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., & Farajtabar, M. (2025). The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity. arXiv preprint arXiv:2506.06941.

Snell, C., Lee, J., Xu, K., & Kumar, A. (2024). Scaling LLM test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314. https://doi.org/10.48550/arXiv.2408.03314

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. "Chain-of-thought prompting elicits reasoning in large language models." Advances in neural information processing systems35 (2022): 24824-24837.

Part 3: Psychometric AI methods & tools

Arnulf, J. K., Larsen, K. R., Martinsen, Ø. L., & Nimon, K. F. (2021). Editorial: Semantic Algorithms in the Assessment of Attitudes and Personality. Frontiers in Psychology, 12, 720559. https://doi.org/10.3389/fpsyg.2021.720559

Borsboom, D., Mellenbergh, G. J., & Van Heerden, J. (2004). The concept of validity. Psychological review111(4), 1061.

Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7(3), 309-319. https://doi.org/10.1037/1040-3590.7.3.309

Drasgow, F. (1987). Study of the measurement bias of two standardized psychological tests. Journal of Applied psychology72(1), 19.

Equal Employment Opportunity Commission (EEOC), Civil Service Commission, Department of Labor, & Department of Justice. (1978). Uniform guidelines on employee selection procedures. Federal Register, 43, 38290–39315.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Psychology Press.

Evers, A., Muñiz, J., Hagemeister, C., Høstmælingen, A., Lindley, P., Sjöberg, A., & Bartram, D. (2013). Assessing the quality of tests: Revision of the EFPA review model. Psicothema, 25(3), 283-291. https://doi.org/10.7334/psicothema2013.97

Fyffe, S., Lee, P., & Kaplan, S. (2024). “Transforming” personality scale development: Illustrating the potential of state-of-the-art natural language processing. Organizational Research Methods, 27(2), 265-300. https://journals.sagepub.com/doi/abs/10.1177/10944281231155771

Hernandez, I., & Nie, W. (2023). The AI‐IP: Minimizing the guesswork of personality scale item development through artificial intelligence. Personnel Psychology, 76(4), 1011-1035. https://onlinelibrary.wiley.com/doi/full/10.1111/peps.12543

Hickman, L., Liff, J., Rottman, C., & Calderwood, C. (2024). The Effects of the Training Sample Size, Ground Truth Reliability, and NLP Method on Language-Based Automatic Interview Scores’ Psychometric Properties. Organizational Research Methods, 10944281241264027. https://journals.sagepub.com/doi/abs/10.1177/10944281241264027

Hommel, B. E., Wollang, F. J. M., Kotova, V., Zacher, H., & Schmukle, S. C. (2022). Transformer-based deep neural language modeling for construct-specific automatic item generation. Psychometrika, 87(2), 749-772. https://www.cambridge.org/core/journals/psychometrika/article/transformerbased-deep-neural-language-modeling-for-constructspecific-automatic-item-generation

Hussain, Z., Binz, M., Mata, R., & Wulff, D. U. (2024). A tutorial on open-source large language models for behavioral science. Behavior Research Methods, 56(8), 8214-8237. https://link.springer.com/article/10.3758/s13428-024-02455-8

International Test Commission. (2014). ITC guidelines on quality control in scoring, test analysis, and reporting of test scores. International Journal of Testing, 14(3), 195-217. https://doi.org/10.1080/15305058.2014.918040

Karpukhin, V., Oguz, B., Min, S., Lewis, P. S., Wu, L., Edunov, S., et al. (2020, November). Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 6769–6781).

Lee, P., Fyffe, S., Son, M., Jia, Z., & Yao, Z. (2023). A paradigm shift from “human writing” to “machine generation” in personality test development: An application of state-of-the-art natural language processing. Journal of Business and Psychology38(1), 163-190.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. IAP.

Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological reports3(3), 635-694.

McDonald, R. P. (2013). Test theory: A unified treatment. psychology press.

Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085.

Society for Industrial and Organizational Psychology. (2018). Principles for the validation and use of personnel selection procedures (5th ed.). Industrial and Organizational Psychology, 11(S1), 1-97. https://doi.org/10.1017/iop.2018.195

von Davier, A. A., Mislevy, R. J., & Hao, J. (Eds.). (2022). Computational psychometrics: New methodologies for a new generation of digital learning and assessment: With examples in R and Python. Springer Nature.

Wulff, D. U., & Mata, R. (2025). Semantic embeddings reveal and address taxonomic incommensurability in psychological measurement. Nature Human Behaviour, 1-11.

Nature Semantic embeddings reveal and address taxonomic incommensurability in psychological measurementNature Semantic embeddings reveal and address taxonomic incommensurability in psychological measurement

Xue, M., Xiao, X., Liu, Y., & Wilson, M. (2026). On the consistency of automatic scoring with large language models. Educational and Psychological Measurement. Advance online publication. https://doi.org/10.1177/00131644261418138

Part 4: Psychometric AI hybrid models

Guenole, N., Samo, A., & Sun, T. (2024). Pseudo-Discrimination Parameters from Language Embeddings. OSF Preprint. https://osf.io/preprints/psyarxiv/9a4qx_v1

Guenole, N., D'Urso, E. D., Samo, A., Sun, T., & Haslbeck, J. (2025). Enhancing Scale Development: Pseudo Factor Analysis of Language Embedding Similarity Matrices. OSF Preprint. https://osf.io/preprints/psyarxiv/vf3se_v2

Milano, N., Luongo, M., Ponticorvo, M., & Marocco, D. (2025). Semantic analysis of test items through large language model embeddings predicts a-priori factorial structure of personality tests. Current Research in Behavioral Sciences8, 100168. https://www.sciencedirect.com/science/article/pii/S2666518225000014

Russell-Lasalandra, L. L., Christensen, A. P., & Golino, H. (2024, September). Generative Psychometrics via AI-GENIE: Automatic Item Generation and Validation via Network-Integrated Evaluation. https://osf.io/preprints/psyarxiv/fgbj4_v1

Part 5: End-to-end neural models

Part 6: Critical reflections

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021, March). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency(pp. 610-623).

Forsyth, D. R. (1980). A taxonomy of ethical ideologies. Journal of Personality and Social psychology39(1), 175.

Lindblom, C. (2018). The science of “muddling through”. In Classic readings in urban planning (pp. 31-40). Routledge.

Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce racism. In Algorithms of oppression. New York university press.

O'Neil, C. (2017). Weapons of math destruction: How big data increases inequality and threatens democracy. Crown.

Westin, A. F. (1968). Privacy and freedom. Washington and Lee Law Review25(1), 166.

Whittaker, M., Crawford, K., Dobbe, R., Fried, G., Kaziunas, E., Mathur, V., West, S.M., Richardson, R., Schultz, J. and Schwartz, O., 2018. AI now report 2018 (pp. 1-62). New York: AI Now Institute at New York University.

Winner, L. (2017). Do artifacts have politics? In Computer ethics (pp. 177-192). Routledge.

Previous page

Fine-tuning LLMs: Adapters, LoRA, QLoRA, and related methods

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

image
Google scholar profile for Nigel Guenole - AI psychometrics research
Linkedin profile for Nigel Guenole - AI assessment consulting and strategy