- Conventional standards
- Aims for this section
- Standards for Educational and Psychological Testing (2014)
- Uniform Guidelines of Employee Selection (1978)
- Principles for the Validation and Use of Personnel Selection Procedures (2018)
- EFPA Test Review Model (2025)
- EU AI Act requirements (2024)
- References
Conventional standards
The objectives in psychological measurement and how they are conventionally achieved are clearly elaborated in foundational journal articles in psychology (e.g., Clark & Watson, 1995; Borsboom et al., 2004; Drasgow, 1989; Loevinger, 1957) and classic texts in the field (e.g., Bollen, 1989; Embretson & Reise, 2000; Lord & Novick, 1968; McDonald, 1999).
The emergence of AI does not change these ultimate objectives in psychological measurement. In fact, many of the most promising applications of AI in measurement demonstrate a bridge between psychometric methods described in the aforementioned resources and the techniques that are now available from computer science.
Many of the AI methods proposed in upcoming sections are novel and experimental. When they are applied in practice, they must demonstrate compliance with established standards for reliability, validity, and bias (AERA, APA, NCME, 2014; EEOC, 1978; International Test Commission, 2014; SIOP, 2018).
AI-based assessments introduce additional transparency challenges regarding data provenance, both in model training data and in how scores are derived from inputs. These issues make compliance with privacy regulations and legal frameworks particularly important, including GDPR, the EU AI Act, and applicable US legislation governing automated assessment and decision-making.
Novel concepts, such as validity evidence based on semantic relatedness, represent extensions, rather than replacements, of conventional psychometric criteria. These new methods must be integrated into our foundational quality frameworks, showing how they contribute to the achieving the same standards required under non-AI measurement.
Ultimately, new AI methods must be judged using empirical candidate responses against conventional psychometric standards. They must demonstrate reliability, validity, absence of bias, meet criteria for perceived fairness, and provide acceptable candidate experiences. Only evidence meeting these standards can establish their utility.
Aims for this section
Psychologists have conventionally considered the acceptability of psychometric assessment methods in terms of their reliability and validity. However, new concepts describing AI measurement systems, or at least extensions to these pre-existing concepts, may be required in the age of AI and psychological measurement. New requirements from regulatory bodies must also be considered.
The key documents describing reliability and validity in workplace assessment, the focus of this book, are the Standards for Educational and Psychological Testing (2014), the Uniform Guidelines on Employee Selection Procedures (EEOC, 1978), the Principles for the Validation and Use of Personnel Selection procedures (2018), and the European Federation of Psychological Association’s Test Review Model (2025). More recently, legislation such as the EU AI Act (2024) has emerged.
In this section, I review some of the key aspects of each of these documents. The topic of bias is reviewed in a separate section. No review can accurately summarize all aspects of these important documents and I refer readers to the original hyperlinked documents that follow below. This is intended as a preliminary reminder of these standards for creators of AI systems and to highlight core elements rather than to be a replacement for these documents.
This brief overview of standards documents in our field has two goals. The first is to remind ourselves of the relevant standards that AI measurement methodologies must meet. We will discuss how these apply to AI methods in upcoming sections of this book. Second is to highlight the standards with apparent relevance to AI. However, as most standards pre-date AI breakthroughs, they are in need of updating before clarity on these matters is assumed.
Standards for Educational and Psychological Testing (2014)
The Standards describe validity is “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests”. They contain specific standards for occupational environments in an employment testing section, paraphrased as follows.
11.5 Do local validation when research is limited and ground in prior research. 11.6 Use local validation only when jobs, samples, and measures are adequate. 11.7 Criteria must represent key job-related behaviors. 11.8 Note and report artifacts affecting results. 11.9 Generalize studies only when jobs and measures closely match. 11.10 Show test scores link to success across job levels where test will be used. 11.11 Use content validity only if the new job and context match. 11.12 Show the test measures the predictor construct and links to the criterion. The Standards define reliability as “the degree to which test scores for a group of test takers are consistent over repeated applications of a measurement procedure and hence are inferred to be dependable and consistent for an individual test taker; the degree to which scores are free of random errors of measurement for a given group”. These standards are general rather than occupationally specific. 2.0 Provide evidence of reliability for each score use. 2.1 Define the conditions under which reliability is evaluated. 2.2 Match reliability evidence to test design and purpose. 2.3 Report reliability for all reported scores. 2.4 Include reliability for change or difference scores. 2.5 Use estimation methods suited to test structure. 2.6 Don’t interchange different reliability coefficients. 2.7 Report interrater and internal consistency when scoring is subjective. 2.8 Provide local reliability data for locally scored tests. 2.9 Give reliability evidence for all test versions. 2.10 Report reliability separately for major test variations. 2.16 Estimate consistency of classifications across replications. 2.17 Report standard errors for group means. 2.18 Reflect sampling design in reliability analyses. 2.19 Document reliability methods, samples, and findings.
Although the Standards 2014 do not mention “faithfulness” or “plausibility,” their requirements for clear rationales, documentation, and minimization of construct-irrelevant variance align conceptually with modern expectations for transparent and interpretable AI models.
Read the full Standards.
Uniform Guidelines of Employee Selection (1978)
According to the Uniform Guidelines, there are three acceptable types of validity studies, criterion related validity studies, content related validity studies, and construct validity studies. More specifically, the descriptions are as follows. “Evidence of the validity of a test or other selection procedure by a criterion-related validity study should consist of empirical data demonstrating that the selection procedure is predictive of or significantly correlated with important elements of job performance”. “Evidence of the validity of a test or other selection procedure by a content validity study should consist of data showing that the content of the selection procedure is representative of important aspects of performance on the job for which the candidates are to be evaluated”. “Evidence of the validity of a test or other selection procedure through a construct validity study should consist of data showing that the procedure measures the degree to which candidates have identifiable characteristics which have been determined to be important in successful performance in the job for which the candidates are to be evaluated”.
The words faithfulness and plausibility do not appear in the Uniform Guidelines, but once again, are these criteria could be implied by the requirements for validity studies.
Read the full Uniform Guidelines.
Principles for the Validation and Use of Personnel Selection Procedures (2018)
The SIOP Principles state that the guide “embraces the Standards' definition of validity as "the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests" and that “validity is the most important consideration in developing and evaluating selection procedures.” The guide provides more detailed practical design requirements that are suited to work related settings. The SIOP Principles extend the Standards with work-specific technical advice, emphasizing thorough documentation and defensible validation, concepts that align in spirit with faithfulness and plausibility, though these terms are not directly used. Read the original SIOP Principles
EFPA Test Review Model (2025)
This is a framework for the through description and evaluation of psychometric tests from the European Federation of Psychological Associations (EFPA). It contains detailed guidance for the production of materials justifying the reliability and validity of psychometric assessment tests.
Read the Test Review Model.
EU AI Act requirements (2024)
Closer to the requirements of faithfulness and plausibility are the requirements in AI specific legislation. Under the EU Artificial Intelligence Act, AI systems used in hiring processes are explicitly classified as high-risk, subjecting them to stringent regulatory obligations to safeguard fundamental rights, mitigate bias, and ensure transparency.
The considerations for high-risk systems should also be considered during development and prior to deployment. For instance, a requirement is documentation of relevant data-preparation processing operations, such as annotation, labelling, cleaning, updating, enrichment, and aggregation and an assessment of the availability, quantity and suitability of the data sets that are needed. Read the requirements for high-risk systems under the EU AI Act.
References
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
Bollen, K. A. (1989). Structural equations with latent variables. John Wiley & Sons.
Borsboom, D., Mellenbergh, G. J., & Van Heerden, J. (2004). The concept of validity. Psychological review, 111(4), 1061.
Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7(3), 309-319. https://doi.org/10.1037/1040-3590.7.3.309
Drasgow, F. (1987). Study of the measurement bias of two standardized psychological tests. Journal of Applied psychology, 72(1), 19.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Psychology Press.
Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department of Justice. 1978. Uniform guidelines on employee selection procedures 29 C.F.R. § 1607, as amended up to September 29, 2025. U.S. Government Publishing Office. https://www.ecfr.gov/current/title-29/part-1607
European Parliament & Council of the European Union. 2024, June 13. Regulation EU 2024/1689 of the European Parliament and of the Council laying down harmonised rules on artificial intelligence and amending certain Union legislative acts. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/o
Evers, A., Muñiz, J., Hagemeister, C., Høstmælingen, A., Lindley, P., Sjöberg, A., & Bartram, D. (2013). Assessing the quality of tests: Revision of the EFPA review model. Psicothema, 25(3), 283-291. https://doi.org/10.7334/psicothema2013.97
International Test Commission. (2014). ITC guidelines on quality control in scoring, test analysis, and reporting of test scores. International Journal of Testing, 14(3), 195-217. https://doi.org/10.1080/15305058.2014.918040
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. IAP.
Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological reports, 3(3), 635-694.
McDonald, R. P. (2013). Test theory: A unified treatment. psychology press.
Molnar, C. 2020. Interpretable machine learning. Lulu.com. Samek, W., Montavon, G., Lapuschkin, S., Anders, C. J., & Müller, K. R. 2021. Explaining deep neural networks and beyond: A review of methods and applications. Proceedings of the IEEE, 1093, 247-278.
Society for Industrial and Organizational Psychology. (2018). Principles for the validation and use of personnel selection procedures (5th ed.). Industrial and Organizational Psychology, 11(S1), 1-97. https://doi.org/10.1017/iop.2018.195
Next page
XAI: Faithful and Plausible AI measurement
Last page
Emergent LLM capabilities
Return home
Psychometrics.ai