Vision Transformers for Psychologists: ViT, CLIP & Stable Diffusion Explained

Vision tansformers
Vision tokenziation
Architectural categories
Vision encoders
Autoregressive decoders and generative models
Vision encoder-decoders
From vision to video
Training vision and video transformers
References

Vision tansformers

Modern transformer models are Large Multimodal Models (LMMs) which we discuss later this book. In addition to language, a primary modality is vision. In this section we introduce vision transformers in anticipation of exploring LMM applications in psychometrics in later chapters.

Vision transformers enable automated analysis of visual stimuli at scale. Vision transformer models work in essentially the same way as language transformers, aside from the difference that language models process token embeddings that are created from word fragments, while vision transformer tokens come from splitting images into fixed-size patches that are projected into embeddings (Dosovitskiy et al., 2020; Vaswani et al., 2017).

Vision tokenziation

The left panel takes a figure from the Abstraction and Reasoning Corpus (ARC) by François Chollet (2019). It is used here only as a motivating example to illustrate the patch tokenization process for images when using vision transformers (Dosovitskiy et al., 2020). It is important to note that vision transformers alone are not well-suited for solving these ARC puzzles. The image below was generated with Python.

Diagram showing Vision Transformer patch tokenization process in six steps: (1) Original 224×224 pixel colorful grid image, (2) Image divided into 14×14 grid creating 196 non-overlapping 16×16 pixel patches, (3) One extracted 16×16 patch shown with RGB channels, (4) Patch flattened into vector of 768 numbers, (5) Linear projection by 768×768 learned weight matrix to create patch embedding, (6) Prepending of learnable CLS token and addition of positional embeddings. Bottom panel shows final token sequence of 197 tokens (1 CLS plus 196 patches), each 768 dimensions, ready to be fed to transformer encoder.

Architectural categories

While the broad architectural categories are the same between language models and vision models, there are often different emphases that we now discuss. The figure below describes the data flows across the core encoder, decoder, and encoder decoder vision transformer variants. We also discuss the extension of vision transformers that use spatial tokens to video transformers that use spatiotemporal tokens. First we discuss the forms of vision transformer model.

Flowchart comparing three vision transformer architectures at inference: Encoders (ViT, CLIP) convert images to patches to embeddings through self-attention, outputting representations for task-specific heads. Decoders/generative models (Stable Diffusion) process image/text tokens through self-attention with diffusion or auto-regressive methods to generate images. Encoder-decoders (BLIP) combine both, processing images via encoder and text via decoder with cross-attention to generate text or images.

Vision encoders

Vision encoders are transformer architectures applied to images and are widely used for representation learning and recognition tasks. This contrasts with modern generative language models, which are typically decoder-only transformers.In training and at inference vision encoders use full self-attention so each token attends to all other tokens. They are commonly used in conjunction with a specialised head for downstream tasks like label prediction, such as vision encoder prediction of visual reasoning item parameters.

An example of the prototypical modern vision encoder is ViT (Vision Transformer) by Dosovitskiy et al. (2020) or CLIP by OpenAI (2021). In psychology, this enables automated coding of facial expressions and quantifying visual features in stimulus sets. Given CLIP is in fact a dual encoder with vision and language models, another psychology application is checking how closely aligned embeddings of human descriptions of images are with vision embeddings of the same images.

Autoregressive decoders and generative models

Vision decoders, in the transformers context, take encoded representations and generate output image representations, which may be discrete tokens or continuous latent representations depending on the architecture. In multimodal architectures, the input representation can be based on multiple modalities such as text, image and audio depending on the architecture.

The decoder process can be causal, i.e., next tokens are conditioned on previous tokens only or alternatively image generation can use a non-transformer diffusion-based strategy where a noisy latent representation is iteratively refined. A possible application in psychology is generating variations of facial expressions. An example of the prototypical modern diffusion-based image generation model is Stable Diffusion by Rombach et al. (2022).

Vision encoder-decoders

Vision encoder decoders are used in sequence-to-sequence (seq-2-seq) tasks and may be unimodal (e.g. image reconstruction) or multimodal (e.g., image captioning). They encode visual inputs into latent representations and decode those representations into outputs, which may be visual or linguistic depending on the decoder. Vision encoder-decoders combine bidirectional attention in the encoder, causal self-attention in the decoder, and cross-attention so the decoder conditions on the input and previously generated tokens.

An example of the prototypical modern vision encoder-decoder is BLIP by Li et al. (2022). Common applications are image captioning and visual question answering e.g. “is it raining in this picture”. In psychology, a potential application is naming emotions. Extreme care is required with this application as AI emotion recognition is a prohibited application in certain contexts under the EU AI act.

From vision to video

Video transformers extend vision transformers by adding a time dimension to the same three architectural options described for vision transformers. Video is turned into spatio-temporal tokens. These tokens patch embedding sequences sometimes called patch tubes or tublets. Video transformers apply spatial attention within each frame and temporal attention across frames.

This spatial and temporal attention can happen either sequentially within frames and then across frames, which is called factorisation, or jointly in one step. Examples include TimeSformer by Bertasius et al. (2021) and ViViT by Arnab et al. (2021). In psychology, video transformers are used for automated analysis of dynamic behaviour such as coding nonverbal communication.

Training vision and video transformers

Training for vision and video transformers involves pre-training on large unlabelled datasets to learn general representations and then fine-tuning on task-specific labelled data. For vision, masked image modeling (masking random patches and reconstructing them) and contrastive learning (aligning images with text descriptions) are the core approaches. Video encoders extend these methods to the temporal dimension.

Video transformers are more resource-intensive than vision transformers. Processing short clips usually requires significant GPU memory and compute time, which may make large-scale projects difficult unless they are well resourced. Frame sampling strategies can be used in some situations but frame sampling choices affect what the model captures. The best sampling rate depends on focus of interest, e.g. shorter frames for micro-expressions vs longer frames for conversational exchanges.

References

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6836-6846).

Bertasius, G., Wang, H., & Torresani, L. (2021, July). Is space-time attention all you need for video understanding?. In Icml (Vol. 2, No. 3, p. 4).

Chollet, F. (2019). On the Measure of Intelligence. arXiv preprint arXiv:1911.01547.

Chollet, F. (2019). The Abstraction and Reasoning Corpus (ARC). GitHub repository. https://github.com/fchollet/ARC

Dosovitskiy, A. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Li, J., Li, D., Xiong, C., & Hoi, S. (2022, June). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning (pp. 12888-12900). PMLR.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684-10695).

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Next section

Large Multi-Modal Models (LMMs) for visual reasoning parameter pre-knowledge

Last section

Caution on multiple encoder pseudo-MTMM

Return home

Psychometrics.ai

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).