Vision-encoder prediction of matrix reasoning parameters

Vision-encoder prediction of matrix reasoning parameters

In this section I share an early investigation of vision transformer encodings of matrix reasoning items and their relation to difficulty and discrimination. If I had to give a rationale for why I checked this, it’s that item parameters might depend on visual complexity and vision transformers might encode that.

However, I’m very empirical about these things and find it practical to check whether there is an effect before trying to explain it. In this case, we see that in the approaches described below, the vision transformers don’t seem to effectively encode features that drive difficulty and discrimination for matrix items.

Whether factors or components of raw VLM embeddings or similar matrices were used or embedding elements were used as predictors in machine learning models, effect sizes were small (.30 and under). I used one of the MITRE assessments, which only had 30 items, so this check was a bit underpowered and I will repeated it with more items soon.

You can see the code for this analysis showing all of the null results at this Github link and see the methods that were tried below.

Method 1. CLIP, embedding predictions of parameters

Embed with OpenAI CLIP and use predictor models including ridge, lasso, elastic net, random forest, gradient boosting, and XGBoost.

Method 2. CLIP, factor correlations with parameters

Factor analyze raw CLIP embeddings and similarity matrix of embeddings, check correlations with difficulty and discrimination.

Method 3. DINO-v2, factor correlations with parameters

Repeat 2 with self supervised DINOv2 embeddings (not trained at image-text matching like CLIP so might perform better).

Method 4. Multimodal, Vision + LLMs, predictions of parameters

Multimodal, where the vision transformer encodes the image, use the LLM to describe it, and uses elements in factor analysis and as predictors.

Method 4 was not true multimodal, I embedded the image, described with the LLM, discarded the vision embedding and analysed the LLM embedding.

True multimodal follows the same route but either fuses vision and language embeddings or generate vision / language embeddings from a common vocabulary. The embeddings can then be used in factor analysis or prediction.

Next Section

Convex hull analyses

Last section

Pseudo factor analysis results

Return home

Psychometrics.ai

image
Google scholar profile for Nigel Guenole - AI psychometrics research
Linkedin profile for Nigel Guenole - AI assessment consulting and strategy