From vision to Large Multi-Modal Models (LMMs) for visual reasoning parameter pre-knowledge

From vision to Large Multi-Modal Models (LMMs) for visual reasoning parameter pre-knowledge

In this section I offer a conceptual overview of points to think about when evaluating whether vision and multimodal AI models can help predict visual reasoning item parameters. To date, there is little empirical work on this topic. I note that recent work has applied vision-language models to evaluate model ability on visual reasoning benchmarks including ARC-AGI. However, solving visual reasoning items is only indirectly relevant to the focus of this chapter, which is the prediction of item parameters from visual reasoning stimuli.

The use of LLMs for parameter prediction in non-cognitive language contexts is well-established and text-based cognitive ability parameter modelling is gaining traction. On the other hand, applying Large Multi-Model Models (LMMs) to obtain visual-reasoning item parameters is so far unexplored. I first discuss image-only transformers before shifting to multimodal architectures. I raise a series of considerations for under-taking this research. Some of these considerations highlight an overlap between AI psychometric response simulation and work on artificial general intelligence (AGI).

Diagram illustrating pathways for predicting psychometric difficulty and discrimination parameters from visual reasoning items. Three model families are shown: (1) pure vision encoder with prediction head for direct parameter estimation, (2) autoregressive multimodal decoder, and (3) cross-attention multimodal encoder-decoder. Multimodal models can either predict parameters directly or simulate responses and fit an IRT model to estimate parameters. A baseline using image-encoder embedding similarity and correlation with known parameters is shown for comparison.

Consideration 1: Defining image based questions

The first consideration relates to what we mean by an image-based question. To begin, we’ll restrict the conversation to images without another information modality, meaning the stimulus is an image and candidates respond by selecting from a set of response options that are also images. Later we will discuss Large Multi-Modal Model (LMM) generalizations.

This framing admits typical Raven style matrix items that are image-prompt to image-response selection tasks, but it omits tasks in the ARC item challenge which requires producing rather than choosing a correct response (although the multi-modal models discussed later in this chapter can principle be used when the answer requires generating the correct image). Our initial framing also omits the candidate instruction text for now, but we will revisit this shortly.

Consideration 2: Image only transformer possibilities

The next consideration is deciding between architectures. Vision encoders produce numeric representations, vision decoders produce images from representations, and vision encoder–decoder transform images into new images. We begin with image-only vision encoders to establish whether or not they are sufficient for the task, which motivates a potentially necessary move to multimodal models in later sections.

If difficulty and discrimination emerge from underlying logical rules rather than visual features, image only models may not yield accurate parameter estimates for the reference population, either via prediction, zero/few shot parameter guesses, or by producing answers to the reasoning items for analysis. However, even under such circumstances, a vision encoder could include a specialized head as the predictive model. This is option one, the first possibility that needs exploring.

An image-only vision decoder produces images from numeric representations rather than performing discrete choice selection or reasoning. This makes it at best an indirect approach to predicting item parameters or selecting among response options (e.g., the generated image responses could be right / wrong scored). However, a decoder could in principle be adapted with a task-specific head. Since the encoder alone provides this capacity without the decoder, it is a more parsimonious choice.

A vision encoder-decoder transforms images into new images, facing the same limitation as the decoder alone. Its default output modality is image generation rather than discrete prediction or choice selection. A task-specific head could again be added.

A final image-only option is to extract decoder embedding representations as predictors rather than generating outputs directly. However, this is still a variation of the encoder-with-head approach and offers no clear advantage over using encoder embeddings alone. Overall, an encoder with a specialised head appears to be the most straightforward image-only architecture for mapping latent visual features to item parameters.

Consideration 3: Multimodal transformer variations

Frontier AI models today integrate multiple modalities. Current Large Multi-Modal Models (LMMs) typically handle cross-modal integration in two ways: through a shared vocabulary (here image patches and tokens) or, more commonly, separate specialized encoders for each modality where representations are subsequently fused into a shared latent space.

Multimodal vision and text decoders that auto-regressively decode as text may also be effective here and are therefore a second plausible option. This approach could produce responses for analysis, or the model can guess the parameters. Alternatively, the model could be prompted for its inferred reasoning rules as an intermediate step to embed and predict item parameters.

A multimodal vision plus text decoder that decodes auto-regressively with cross attention is also a possibility. This third option works like option two, except that an encoder–decoder architecture processes the image with a bi‑directional encoder, and the decoder generates text conditioned on that representation via cross‑attention.

It is noteworthy that the set-up for options 2 and 3 would see psychometricians indirectly entering the conversation about artificial general intelligence. The key difference being that AGI challenges are maximum capability challenges and parameter estimation would require simulating responses that reflect a distribution, not just maximum ability.

The question of whether we wish to predict parameters or to generate data that can be used to estimate parameters is an important one. Option 1 that we described is a prediction approach. Options 2 and 3 can be used in either way depending on how the models are prompted but might naturally lend themselves to generating response data to estimate parameters.

Consideration 4: Stimulus and response option processing

This design choice applies across the architectures discussed above but becomes especially consequential for the multimodal options. A priori, we might expect that whole image processing is preferable when relational information is present in the global layout or cross‑section relations that may be lost by decomposition, while sub‑image processing may work when each cell or option can be evaluated independently.

Instruction text can be potentially be encoded alongside the visual stimulus as a text token sequence where multimodal models are applied. This allows conditioning of output representations (embeddings or parameters) on task instructions, similar to the way candidates use instructions to guide their reasoning.

Consideration 5: Whether to encode the answer

The next consideration is whether to provide the answer key to the model. The appropriateness of this varies according to the proposed method. For option one, the encoder approach, we may wish to encode the answer with the transformer model and use it as the criterion for prediction in a machine learning model. This would show if the stimulus and responses encode performance relevant content.

Other versions of option 1 may benefit from encoding the correct answer. The encoding method could involve encoding the full visual image with the correct answer marked, encoding the stimulus and options separately and fusing the embeddings with the correct answer always at the same position, or in the case of fine-tuning, reserving a learnable vector for the correct answer.

In options 2 and 3 where we are using the multimodal encoder to predict the actual item parameter directly, it makes sense to inform the model of the correct answer. This is no different to the testing situation where we know the correct answer prior to administering the items and estimating parameters.

In options 2 and 3 where we simulate item responses to the items using a multimodal model for later parameter estimation, it does not make sense to encode the answer for the model. Here we need the model to go through reasoning to solve the problem that reflects people of different ability levels.

Consideration 6: Embedding-similarity as a baseline

Simple baselines can be defined by taking fixed image‑encoder representations of each item and testing whether embeddings correlate with known target IRT parameters, I.e., item difficulty and item discrimination.

Alternatively, the similarity between response options and the stimulus could be examined to see whether the correct options have closer similarity. The similarities of response embeddings to the stimulus embedding or the embeddings themselves could be used to predict the target parameters. This baseline will allow checking of how much value multimodal reasoning adds over raw vision.

Summary

In summary, the most promising avenues for item parameter prediction for vision reasoning items are likely to be a vision encoder with a specialized prediction head (option 1), a multimodal vision-text decoder (option 2), and a multimodal vision encoder with a text decoder using cross attention (option 3).

Next Section

Convex hull analyses

Last section

Pseudo factor analysis results

Return home

Psychometrics.ai

image
Google scholar profile for Nigel Guenole - AI psychometrics research
Linkedin profile for Nigel Guenole - AI assessment consulting and strategy