- Traditional considerations
- Aims for this section
- Standards for Educational and Psychological Testing (2014)
- Uniform Guidelines of Employee Selection (1978)
- Principles for the Validation and Use of Personnel Selection Procedures (2018)
- EFPA Test Review Model (2025)
- EU AI Act requirements (2024)
- Why Faithfulness and Plausibility matter for AI assessment
- Working framework to evaluate AI measurement systems
- References
Note: This section will eventually present ideas for evaluating AI measurement techniques and scoring systems. The ideas are still being refined based on expert feedback. The section does not yet integrate widely accepted measurement standards across psychology and machine learning or advance them. For now the section primarily lists them. I would value your perspectives on reliability, validity, faithfulness and plausibility in the context of AI measurement as I develop this section. Please get in touch if you have comments that would help advance these ideas!
Traditional considerations
Psychologists have conventionally considered the acceptability of psychometric assessment methods in terms of their reliability and validity. However, new concepts describing AI measurement systems, or at least extensions to these pre-existing concepts, may be required in the age of AI and psychological measurement. New requirements from regulatory bodies must also be considered.
The key documents describing reliability and validity in workplace assessment, the focus of this book, are the Standards for Educational and Psychological Testing (2014), the Uniform Guidelines on Employee Selection Procedures (EEOC, 1978), the Principles for the Validation and Use of Personnel Selection procedures (2018), and the European Federation of Psychological Association’s Test Review Model (2025). More recently, legislation such as the EU AI Act (2024) has emerged.
Aims for this section
In this section, I review some of the key aspects of each of these documents. The topic of bias is reviewed in a separate section. No review can accurately summarize all aspects of these important documents and I refer readers to the original hyperlinked documents that follow below. This is intended as a preliminary reminder of these standards for creators of AI systems and to highlight core elements rather than to be a replacement for these documents.
This brief overview of standards documents in our field has two goals. The first is to remind ourselves of the relevant standards that AI measurement methodologies must meet. We will discuss how these apply to AI methods in upcoming sections of this book. Second is to highlight the standards with apparent relevance to AI. However, as most standards pre-date AI breakthroughs, they are in need of updating before clarity on these matters is assumed.
Notably, I emphasise standards that might be linked to machine learning concepts of faithfulness/transparency e.g., Jacovi & Goldberg, 2020; Samek et al., 2021) and plausibility (e.g., Doshi-Velez & Kim, 2017). Faithfulness describes how well we understand a model’s actual mechanisms (or what the system did/how a score was produced), and plausibility describes how justifiable and coherent an algorithm is (why a score was produced).
Standards for Educational and Psychological Testing (2014)
The Standards describe validity is “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests”. They contain specific standards for occupational environments in an employment testing section, paraphrased as follows.
11.5 Do local validation when research is limited and ground in prior research. 11.6 Use local validation only when jobs, samples, and measures are adequate. 11.7 Criteria must represent key job-related behaviors. 11.8 Note and report artifacts affecting results. 11.9 Generalize studies only when jobs and measures closely match. 11.10 Show test scores link to success across job levels where test will be used. 11.11 Use content validity only if the new job and context match. 11.12 Show the test measures the predictor construct and links to the criterion. The Standards define reliability as “the degree to which test scores for a group of test takers are consistent over repeated applications of a measurement procedure and hence are inferred to be dependable and consistent for an individual test taker; the degree to which scores are free of random errors of measurement for a given group”. These standards are general rather than occupationally specific. 2.0 Provide evidence of reliability for each score use. 2.1 Define the conditions under which reliability is evaluated. 2.2 Match reliability evidence to test design and purpose. 2.3 Report reliability for all reported scores. 2.4 Include reliability for change or difference scores. 2.5 Use estimation methods suited to test structure. 2.6 Don’t interchange different reliability coefficients. 2.7 Report interrater and internal consistency when scoring is subjective. 2.8 Provide local reliability data for locally scored tests. 2.9 Give reliability evidence for all test versions. 2.10 Report reliability separately for major test variations. 2.16 Estimate consistency of classifications across replications. 2.17 Report standard errors for group means. 2.18 Reflect sampling design in reliability analyses. 2.19 Document reliability methods, samples, and findings.
Although the Standards 2014 do not mention “faithfulness” or “plausibility,” their requirements for clear rationales, documentation, and minimization of construct-irrelevant variance align conceptually with modern expectations for transparent and interpretable AI models.
Read the full Standards.
Uniform Guidelines of Employee Selection (1978)
According to the Uniform Guidelines, there are three acceptable types of validity studies, criterion related validity studies, content related validity studies, and construct validity studies. More specifically, the descriptions are as follows. “Evidence of the validity of a test or other selection procedure by a criterion-related validity study should consist of empirical data demonstrating that the selection procedure is predictive of or significantly correlated with important elements of job performance”. “Evidence of the validity of a test or other selection procedure by a content validity study should consist of data showing that the content of the selection procedure is representative of important aspects of performance on the job for which the candidates are to be evaluated”. “Evidence of the validity of a test or other selection procedure through a construct validity study should consist of data showing that the procedure measures the degree to which candidates have identifiable characteristics which have been determined to be important in successful performance in the job for which the candidates are to be evaluated”.
The words faithfulness and plausibility do not appear in the Uniform Guidelines, but once again, are these criteria could be implied by the requirements for validity studies.
Read the full Uniform Guidelines.
Principles for the Validation and Use of Personnel Selection Procedures (2018)
The SIOP Principles state that the guide “embraces the Standards' definition of validity as "the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests" and that “validity is the most important consideration in developing and evaluating selection procedures.” The guide provides more detailed practical design requirements that are suited to work related settings. The SIOP Principles extend the Standards with work-specific technical advice, emphasizing thorough documentation and defensible validation, concepts that align in spirit with faithfulness and plausibility, though these terms are not directly used. Read the original SIOP Principles
EFPA Test Review Model (2025)
This is a framework for the through description and evaluation of psychometric tests from the European Federation of Psychological Associations (EFPA). It contains detailed guidance for the production of materials justifying the reliability and validity of psychometric assessment tests.
Read the Test Review Model.
EU AI Act requirements (2024)
Closer to the requirements of faithfulness and plausibility are the requirements in AI specific legislation. Under the EU Artificial Intelligence Act, AI systems used in hiring processes are explicitly classified as high-risk, subjecting them to stringent regulatory obligations to safeguard fundamental rights, mitigate bias, and ensure transparency.
The considerations for high-risk systems should also be considered during development and prior to deployment. For instance, a requirement is documentation of relevant data-preparation processing operations, such as annotation, labelling, cleaning, updating, enrichment, and aggregation and an assessment of the availability, quantity and suitability of the data sets that are needed. Read the requirements for high-risk systems under the EU AI Act.
Why Faithfulness and Plausibility matter for AI assessment
These foundational documents we reviewed do not explicitly address faithfulness and plausibility in the same way the machine learning community is grappling with these issues. This may be because under conventional assessment methods, we write the scoring rules directly and can describe exactly how a score was obtained.
Under AI systems with billions of parameters, we simply cannot trace scoring practices in the same way. We need systems to be as explicit as possible for legal and ethical reasons e.g. the system might give an appropriate score from a human perspective but for the wrong reasons. We also need high faithfulness and plausibility for scientific reasons e.g., so we are sure there is no construct irrelevant variance entering the assessment scoring system.
Working framework to evaluate AI measurement systems
As a working framework, we can consider an AI scoring system in terms of its mechanical transparency i.e., traceability, or what happened: Doshi-Velez & Kim, 2017; Molnar, 2020 and its explanatory plausibility i.e. coherence, or how and why the score was produced e.g., Jacovi & Goldberg, 2020; Samek, 2021. We should to make our understandings of systems as faithful and plausible as possible. The higher the stakes, the higher the standard that must be met.
Complete transparency in AI systems is likely unattainable with transformers, but the traditional psychometric standards don't appear to require it. They instead require sufficient documentation and validation evidence to demonstrate job-relatedness and fairness, which may be achievable even with partially opaque models.
We should further consider that systems that do not directly affect humans, such as in the design of assessments that will ultimately be empirically validated, may not need the same level of faithfulness and plausibility as those directly deployed in scoring. By contrast, scoring systems directly affect the outcomes experienced by humans exposed to AI systems.
References
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. 2014. Standards for educational and psychological testing. American Educational Research Association. Doshi-Velez, F., & Kim, B. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608.
Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department of Justice. 1978. Uniform guidelines on employee selection procedures 29 C.F.R. § 1607, as amended up to September 29, 2025. U.S. Government Publishing Office. https://www.ecfr.gov/current/title-29/part-1607
European Parliament & Council of the European Union. 2024, June 13. Regulation EU 2024/1689 of the European Parliament and of the Council laying down harmonised rules on artificial intelligence and amending certain Union legislative acts. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/o
Jacovi, A., & Goldberg, Y. 2020. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? arXiv preprint arXiv:2004.03685.
Molnar, C. 2020. Interpretable machine learning. Lulu.com. Samek, W., Montavon, G., Lapuschkin, S., Anders, C. J., & Müller, K. R. 2021. Explaining deep neural networks and beyond: A review of methods and applications. Proceedings of the IEEE, 1093, 247-278.
Society for Industrial and Organizational Psychology. 2018. Principles for the validation and use of personnel selection procedures 5th ed. https://www.siop.org
Next page
Moral foundations as a use case
Last page
Testing standards
Return home
Psychometrics.ai