- Method 1. CLIP, embedding predictions of parameters
- Method 2. CLIP, factor correlations with parameters
- Method 3. DINO-v2, factor correlations with parameters
- Method 4. Multimodal, Vision + LLMs, predictions of parameters
In this section I share an early investigation of vision transformer encodings of matrix reasoning items and their relation to difficulty and discrimination. If I had to give a rationale for why I checked this, it’s that item parameters might depend on visual complexity and vision transformers might encode that.
However, I’m very empirical about these things and find it practical to check whether there is an effect before trying to explain it. In this case, we see that in the approaches described below, the vision transformers don’t seem to effectively encode features that drive difficulty and discrimination for matrix items.
Whether factors or components of raw VLM embeddings or similar matrices were used or embedding elements were used as predictors in machine learning models, effect sizes were small (.30 and under). I used one of the MITRE assessments, which only had 30 items, so this check was a bit underpowered and I will repeated it with more items soon.
You can see the code for this analysis showing all of the null results at this Github link and see the methods that were tried below.
Method 1. CLIP, embedding predictions of parameters
Embed with OpenAI CLIP and use predictor models including ridge, lasso, elastic net, random forest, gradient boosting, and XGBoost.
Method 2. CLIP, factor correlations with parameters
Factor analyze raw CLIP embeddings and similarity matrix of embeddings, check correlations with difficulty and discrimination.
Method 3. DINO-v2, factor correlations with parameters
Repeat 2 with self supervised DINOv2 embeddings (not trained at image-text matching like CLIP so might perform better).
Method 4. Multimodal, Vision + LLMs, predictions of parameters
Multimodal, where the vision transformer encodes the image, use the LLM to describe it, and uses elements in factor analysis and as predictors.
Method 4 was not true multimodal, I embedded the image, described with the LLM, discarded the vision embedding and analysed the LLM embedding.
True multimodal follows the same route but either fuses vision and language embeddings or generate vision / language embeddings from a common vocabulary. The embeddings can then be used in factor analysis or prediction.
Next Section
Last section
Pseudo factor analysis results
Return home