References

Part 1: AI Psychometric foundations
Part 2: How LLMs learn
Section 3: Psychometric AI methods & tools
Part 4: Psychometric AI analyses
Part 5: Hybrid models
Part 6: Critical reflections

Part 1: AI Psychometric foundations

Black, S., Gao, L., Wang, P., Leahy, C., & Biderman, S. (2021). GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow. https://github.com/EleutherAI/gpt-neo

Burkov, A. (2025). The hundred-page language models book: Hands-on with PyTorch. Andriy Burkov.

Canagasuriam, D., & Lukacik, E. R. (2025). ChatGPT, can you take my job interview? Examining artificial intelligence cheating in the asynchronous video interview. International Journal of Selection and Assessment, 33(1), e12491.

Casabianca, J. M., McCaffrey, D. F., Johnson, M. S., Alper, N., & Zubenko, V. (2025). Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications. arXiv preprint arXiv:2501.02334.

Dai, W., Tsai, Y. S., Lin, J., Aldino, A., Jin, H., Li, T., ... & Chen, G. (2024). Assessing the proficiency of large language models in automatic feedback generation: An evaluation study. Computers and Education: Artificial Intelligence, 7, 100299.

Financial Times. (2024, Oct 20). How AI groups are infusing their chatbots with personality. Financial Times.

Gage, P. (1994). A new algorithm for data compression. C Users Journal, 12(2), 23–38.

Guenole, N., Samo, A., & Sun, T. (2024). Pseudo-Discrimination Parameters from Language Embeddings. OSF Preprint. https://osf.io/preprints/psyarxiv/9a4qx_v1

Guenole, N., D'Urso, E. D., Samo, A., Sun, T., & Haslbeck, J. (2025). Enhancing Scale Development: Pseudo Factor Analysis of Language Embedding Similarity Matrices. OSF Preprint. https://osf.io/preprints/psyarxiv/vf3se_v2

Hernandez, I., & Nie, W. (2023). The AI‐IP: Minimizing the guesswork of personality scale item development through artificial intelligence. Personnel Psychology, 76(4), 1011-1035.

Hickman, L., Bosch, N., Ng, V., Saef, R., Tay, L., & Woo, S. E. (2022). Automated video interview personality assessments: Reliability, validity, and generalizability investigations. Journal of Applied Psychology, 107(8), 1323.

Hussain, Z., Binz, M., Mata, R., & Wulff, D. U. (2025). A tutorial on open-source large language models for behavioral science. Behavior Research Methods, 56(8), 8214-8237.

Jung, J. Y., Tyack, L., & von Davier, M. (2024). Combining machine translation and automated scoring in international large-scale assessments. Large-scale Assessments in Education, 12(1), 10.

Karpathy, A. nanoGPT: The simplest, fastest repository for training/finetuning medium-sized gpts. https://github.com/karpathy/nanoGPT/ tree/master, 2024.

Karpathy, A. (2025, February 5). Deep Dive into LLMs like ChatGPT. YouTube. http://www.youtube.com/watch?v=7xTGNNLPyMI

Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. *arXiv preprint arXiv:1804.10959

Laverghetta Jr, A., Nighojkar, A., Mirzakhalov, J., & Licato, J. (2021, July). Predicting human psychometric properties using computational language models. In The Annual Meeting of the Psychometric Society (pp. 151-169). Cham: Springer International Publishing.

LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. R. (1998). Efficient backprop. In Neural networks: Tricks of the trade (pp. 9-50). Springer.

Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. W. H. Freeman.

Ormerod, C., Jafari, A., Lottridge, S., Patel, M., Harris, A., & van Wamelen, P. (2021). The effects of data size on Automated Essay Scoring engines. arXiv preprint arXiv:2108.13275.

Pellert, M., Lechner, C. M., Wagner, C., Rammstedt, B., & Strohmaier, M. (2024). Ai psychometrics: Assessing the psychological profiles of large language models through psychometric inventories. Perspectives on Psychological Science, 19(5), 808-826.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.

Raschka, S. (2024). Build a Large Language Model (From Scratch). Simon and Schuster.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.

Russell-Lasalandra, L. L., Christensen, A. P., & Golino, H. (2024, September). Generative Psychometrics via AI-GENIE: Automatic Item Generation and Validation via Network-Integrated Evaluation.

Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. *arXiv preprint arXiv:1508.07909.

Schuster, M., & Nakajima, K. (2012). Japanese and Korean voice search. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Shrestha, I., Tay, L., & Srinivasan, P. (2025). Robust Bias Detection in MLMs and its Application to Human Trait Ratings. arXiv preprint arXiv:2502.15600.

Scott, D., & Suppes, P. (1957). Foundational aspects of theories of measurement. The journal of symbolic logic, 23(2), 113-128.

Suppes, P. (2013). Studies in the Methodology and Foundations of Science: Selected Papers from 1951 to 1969 (Vol. 22). Springer Science & Business Media.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. (2017). Attention is all you need. Advances in neural information processing systems 30.

von Davier, A. A. (2017). Computational psychometrics in support of collaborative educational assessments. Journal of Educational Measurement, 54(1), 3-11.

von Davier, M. (2019). Training Optimus prime, MD: Generating medical certification items by fine-tuning OpenAI's gpt2 transformer model. arXiv preprint arXiv:1908.08594.

Wulff, D. U., & Mata, R. (2025). Wulff, D. U., & Mata, R. (2025). Semantic embeddings reveal and address taxonomic incommensurability in psychological measurement. Nature Human Behaviour, 1-11.

Zhao, H., Chen, H., Yang, F., Liu, N., Deng, H., Cai, H., ... & Du, M. (2024). Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology, 15(2), 1-38.

Part 2: How LLMs learn

Binz, M., Akata, E., Bethge, M., Brändle, F., Callaway, F., Coda-Forno, J., Dayan, P., Demircan, C., Eckstein, M. K., Éltető, N., Griffiths, T. L., Haridi, S., Jagadish, A. K., Ji-An, L., Kipnis, A., Kumar, S., Ludwig, T., Mathony, M., Mattar, M., Modirshanechi, A., Nath, S. S., Peterson, J. C., Rmus, M., Russek, E. M., Saanum, T., Schubert, J. A., Schulze Buschoff, L. M., Singhi, N., Sui, X., Thalmann, M., Theis, F., Truong, V., Udandarao, V., Voudouris, K., Wilson, R., Witte, K., Wu, S., Wulff, D. U., Xiong, H., & Schulz, E. (2025). Centaur: A foundation model of human cognition(Version 3). arXiv. https://arxiv.org/abs/2410.20268

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023).https://arxiv.org/abs/2305.14314

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-efficient transfer learning for NLP. In International conference on machine learning (pp. 2790-2799). PMLR.

Hommel, B. E., Wollang, F.-J. M., Kotova, V., Zacher, H., & Schmukle, S. C. (2022). Transformer-based deep neural language modeling for construct-specific automatic item generation. Psychometrika, 87(2), 749–772. https://doi.org/10.1007/s11336-021-09823-9

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, L., Wang, W., & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations (ICLR 2022).https://arxiv.org/abs/2106.09685

Section 3: Psychometric AI methods & tools

Arnulf, J. K., Larsen, K. R., Martinsen, Ø. L., & Nimon, K. F. (2021). Editorial: Semantic Algorithms in the Assessment of Attitudes and Personality. Frontiers in Psychology, 12, 720559. https://doi.org/10.3389/fpsyg.2021.720559

Borsboom, D., Mellenbergh, G. J., & Van Heerden, J. (2004). The concept of validity. Psychological review, 111(4), 1061.

Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7(3), 309-319. https://doi.org/10.1037/1040-3590.7.3.309

Drasgow, F. (1987). Study of the measurement bias of two standardized psychological tests. Journal of Applied psychology, 72(1), 19.

Equal Employment Opportunity Commission (EEOC), Civil Service Commission, Department of Labor, & Department of Justice. (1978). Uniform guidelines on employee selection procedures. Federal Register, 43, 38290–39315.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Psychology Press.

Evers, A., Muñiz, J., Hagemeister, C., Høstmælingen, A., Lindley, P., Sjöberg, A., & Bartram, D. (2013). Assessing the quality of tests: Revision of the EFPA review model. Psicothema, 25(3), 283-291. https://doi.org/10.7334/psicothema2013.97

Fyffe, S., Lee, P., & Kaplan, S. (2024). “Transforming” personality scale development: Illustrating the potential of state-of-the-art natural language processing. Organizational Research Methods, 27(2), 265-300. https://journals.sagepub.com/doi/abs/10.1177/10944281231155771

Hernandez, I., & Nie, W. (2023). The AI‐IP: Minimizing the guesswork of personality scale item development through artificial intelligence. Personnel Psychology, 76(4), 1011-1035. https://onlinelibrary.wiley.com/doi/full/10.1111/peps.12543

Hickman, L., Liff, J., Rottman, C., & Calderwood, C. (2024). The Effects of the Training Sample Size, Ground Truth Reliability, and NLP Method on Language-Based Automatic Interview Scores’ Psychometric Properties. Organizational Research Methods, 10944281241264027. https://journals.sagepub.com/doi/abs/10.1177/10944281241264027

Hommel, B. E., Wollang, F. J. M., Kotova, V., Zacher, H., & Schmukle, S. C. (2022). Transformer-based deep neural language modeling for construct-specific automatic item generation. Psychometrika, 87(2), 749-772. https://www.cambridge.org/core/journals/psychometrika/article/transformerbased-deep-neural-language-modeling-for-constructspecific-automatic-item-generation

Hussain, Z., Binz, M., Mata, R., & Wulff, D. U. (2024). A tutorial on open-source large language models for behavioral science. Behavior Research Methods, 56(8), 8214-8237. https://link.springer.com/article/10.3758/s13428-024-02455-8

International Test Commission. (2014). ITC guidelines on quality control in scoring, test analysis, and reporting of test scores. International Journal of Testing, 14(3), 195-217. https://doi.org/10.1080/15305058.2014.918040

Karpukhin, V., Oguz, B., Min, S., Lewis, P. S., Wu, L., Edunov, S., et al. (2020, November). Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 6769–6781).

Lee, P., Fyffe, S., Son, M., Jia, Z., & Yao, Z. (2023). A paradigm shift from “human writing” to “machine generation” in personality test development: An application of state-of-the-art natural language processing. Journal of Business and Psychology, 38(1), 163-190.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. IAP.

Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological reports, 3(3), 635-694.

McDonald, R. P. (2013). Test theory: A unified treatment. psychology press.

Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085.

Russell-Lasalandra, L. L., Christensen, A. P., & Golino, H. (2024, September). Generative Psychometrics via AI-GENIE: Automatic Item Generation and Validation via Network-Integrated Evaluation. https://osf.io/preprints/psyarxiv/fgbj4_v1

Society for Industrial and Organizational Psychology. (2018). Principles for the validation and use of personnel selection procedures (5th ed.). Industrial and Organizational Psychology, 11(S1), 1-97. https://doi.org/10.1017/iop.2018.195

von Davier, A. A., Mislevy, R. J., & Hao, J. (Eds.). (2022). Computational psychometrics: New methodologies for a new generation of digital learning and assessment: With examples in R and Python. Springer Nature.

Wulff, D. U., & Mata, R. (2025). Semantic embeddings reveal and address taxonomic incommensurability in psychological measurement. Nature Human Behaviour, 1-11. Nature Semantic embeddings reveal and address taxonomic incommensurability in psychological measurement

Part 4: Psychometric AI analyses

Part 5: Hybrid models

Part 6: Critical reflections

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021, March). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency(pp. 610-623).

Forsyth, D. R. (1980). A taxonomy of ethical ideologies. Journal of Personality and Social psychology, 39(1), 175.

Lindblom, C. (2018). The science of “muddling through”. In Classic readings in urban planning (pp. 31-40). Routledge.

Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce racism. In Algorithms of oppression. New York university press.

O'Neil, C. (2017). Weapons of math destruction: How big data increases inequality and threatens democracy. Crown.

Westin, A. F. (1968). Privacy and freedom. Washington and Lee Law Review, 25(1), 166.

Whittaker, M., Crawford, K., Dobbe, R., Fried, G., Kaziunas, E., Mathur, V., West, S.M., Richardson, R., Schultz, J. and Schwartz, O., 2018. AI now report 2018 (pp. 1-62). New York: AI Now Institute at New York University.

Winner, L. (2017). Do artifacts have politics? In Computer ethics (pp. 177-192). Routledge.

Previous page

Fine-tuning LLMs: Adapters, LoRA, QLoRA, and related methods

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).