- Vision tansformers
- Vision tokenziation
- Architectural categories
- Vision encoders
- Autoregressive decoders and generative models
- Vision encoder-decoders
- From vision to video
- Training vision and video transformers
- References
Vision tansformers
Vision transformers enable automated analysis of visual stimuli at scale. Vision transformer models work in essentially the same way as language transformers, aside from the difference that language models process token embeddings that are created from word fragments, while vision transformer tokens come from splitting images into fixed-size patches that are projected into embeddings (Dosovitskiy et al., 2020; Vaswani et al., 2017).
Vision tokenziation
The left panel takes a figure from the Abstraction and Reasoning Corpus (ARC) by François Chollet (2019). It is used here only as a motivating example to illustrate the patch tokenization process for images when using vision transformers (Dosovitskiy et al., 2020). It is important to note that vision transformers alone are not well-suited for solving these ARC puzzles. The image below was generated with Python.
Architectural categories
While the broad architectural categories are the same between language models and vision models, there are often different emphases that we now discuss. The figure below describes the data flows across the core encoder, decoder, and encoder decoder vision transformer variants. We also discuss the extension of vision transformers that use spatial tokens to video transformers that use spatiotemporal tokens. First we discuss the forms of vision transformer model.
Vision encoders
Vision encoders are transformer architectures applied to images and are widely used for representation learning and recognition tasks. This contrasts with modern generative language models, which are typically decoder-only transformers.In training and at inference vision encoders use full self-attention so each token attends to all other tokens. They are commonly used in conjunction with a specialised head for downstream tasks like label prediction, such as vision encoder prediction of visual reasoning item parameters.
An example of the prototypical modern vision encoder is ViT (Vision Transformer) by Dosovitskiy et al. (2020) or CLIP by OpenAI (2021). In psychology, this enables automated coding of facial expressions and quantifying visual features in stimulus sets. Given CLIP is in fact a dual encoder with vision and language models, another psychology application is checking how closely aligned embeddings of human descriptions of images are with vision embeddings of the same images.
Autoregressive decoders and generative models
Vision decoders, in the transformers context, take encoded representations and generate output image representations, which may be discrete tokens or continuous latent representations depending on the architecture. In multimodal architectures, the input representation can be based on multiple modalities such as text, image and audio depending on the architecture.
The decoder process can be causal, i.e., next tokens are conditioned on previous tokens only or alternatively image generation can use a non-transformer diffusion-based strategy where a noisy latent representation is iteratively refined. A possible application in psychology is generating variations of facial expressions. An example of the prototypical modern diffusion-based image generation model is Stable Diffusion by Rombach et al. (2022).
Vision encoder-decoders
Vision encoder decoders are used in sequence-to-sequence (seq-2-seq) tasks and may be unimodal (e.g. image reconstruction) or multimodal (e.g., image captioning). They encode visual inputs into latent representations and decode those representations into outputs, which may be visual or linguistic depending on the decoder. Vision encoder-decoders combine bidirectional attention in the encoder, causal self-attention in the decoder, and cross-attention so the decoder conditions on the input and previously generated tokens.
An example of the prototypical modern vision encoder-decoder is BLIP by Li et al. (2022). Common applications are image captioning and visual question answering e.g. “is it raining in this picture”. In psychology, a potential application is naming emotions. Extreme care is required with this application as AI emotion recognition is a prohibited application in certain contexts under the EU AI act.
From vision to video
Video transformers extend vision transformers by adding a time dimension to the same three architectural options described for vision transformers. Video is turned into spatio-temporal tokens. These tokens patch embedding sequences sometimes called patch tubes or tublets. Video transformers apply spatial attention within each frame and temporal attention across frames.
This spatial and temporal attention can happen either sequentially within frames and then across frames, which is called factorisation, or jointly in one step. Examples include TimeSformer by Bertasius et al. (2021) and ViViT by Arnab et al. (2021). In psychology, video transformers are used for automated analysis of dynamic behaviour such as coding nonverbal communication.
Training vision and video transformers
Training for vision and video transformers involves pre-training on large unlabelled datasets to learn general representations and then fine-tuning on task-specific labelled data. For vision, masked image modeling (masking random patches and reconstructing them) and contrastive learning (aligning images with text descriptions) are the core approaches. Video encoders extend these methods to the temporal dimension.
Video transformers are more resource-intensive than vision transformers. Processing short clips usually requires significant GPU memory and compute time, which may make large-scale projects difficult unless they are well resourced. Frame sampling strategies can be used in some situations but frame sampling choices affect what the model captures. The best sampling rate depends on focus of interest, e.g. shorter frames for micro-expressions vs longer frames for conversational exchanges.
References
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6836-6846).
Bertasius, G., Wang, H., & Torresani, L. (2021, July). Is space-time attention all you need for video understanding?. In Icml (Vol. 2, No. 3, p. 4).
Chollet, F. (2019). On the Measure of Intelligence. arXiv preprint arXiv:1911.01547.
Chollet, F. (2019). The Abstraction and Reasoning Corpus (ARC). GitHub repository. https://github.com/fchollet/ARC
Dosovitskiy, A. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Li, J., Li, D., Xiong, C., & Hoi, S. (2022, June). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning (pp. 12888-12900). PMLR.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684-10695).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Next section
Five criteria for choosing a language model in AI psychometrics
Last section
What are encoder-decoder architectures?
Return home
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).