Transformers in Vision

What have Vision Transformers been up to?



Initially introduced with the now-famous Attention is all you need1, the Transformer has dominated the field of Natural Language Processing (NLP) for years. Particularly worth noting is the effort gone into scaling up Transformer-based models, such as BERT2, MegatronLM3, T54, and the various GPTs (GPT5, GPT-26 and GPT-37), due to their favourable scaling characteristics8,9.

The success of transformers in NLP has not gone unnoticed in other fields, where they have been responsible for significant breakthroughs such as AlphaFold 2 in the field of protein folding.

Important works adapting transformers (and self-attention) to vision include Attention Augmented Convolutional Networks10, Stand-Alone Self-Attention models11 (SASA models), DETR12, Visual Transformers13 and LambdaNetworks14; as well as Image Transformers15 and Axial Transformers16 in the generative domain.

For an in-depth introduction to these works, we recommend two recent reviews: A Survey in Visual Transformers17 and Transformers in Vision: A Survey18.


This blog post aims to summarize recent research in applying transformers and self-attention to vision, with a focus on (but scope not limited to) image classification. While by no means exhaustive, it can hopefully represent a starting point for a more in-depth dive into the literature.

We first introduce the Vision Transformer, a simple yet powerful architecture that has had a significant influence on recent research due to its performance in large data regimes. We then continue with the many works studying how to achieve similar high performance using transformers (and self-attention) when data is not as plentiful. Finally, we discuss papers studying the robustness of these models to perturbations as well as their performance in self-supervised, medical and video tasks.

Figures are taken from their respective papers unless the source is explicitly provided in the caption.

Before we begin, it is worth noting that familiarity with transformers and self-attention is recommended; great resources include Jay Alammar’s Illustrated Transformer and Peter Bloem’s Transformers from scratch. For high-quality implementations of many models discussed in this post, check out Ross Wightman’s PyTorch Image Models as well as Phil Wang’s work.

Vision Transformers for Image Recognition

An image is worth 16x16 words: the Vision Transformer

First introduced in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale19, Vision Transformers (ViTs) have taken computer vision by storm, leading to hundreds of citations in the span of a few months. The paper’s main goal was to show that a vanilla Transformer, once adapted to deal with data from the visual domain, could compete with some of the most performant convolutional neural networks (CNNs) developed up to that point.

The Vision Transformer architecture is conceptually simple: divide the image into patches, flatten and project them into a \(D\)-dimensional embedding space obtaining the so-called patch embeddings, add positional embeddings (a set of learnable vectors allowing the model to retain positional information) and concatenate a (learnable) class token, then let the Transformer encoder do its magic. Finally, a classification head is applied to the class token to obtain the model’s logits.

A Vision Transformer classifying an image. [Source.](

Figure 1: A Vision Transformer classifying an image. Source.

The model’s performance was acceptable when trained on ImageNet (\(1\)M images), great when pre-trained on ImageNet-21k (\(14\)M images), and state-of-the-art when pre-trained on Google’s internal JFT-300M dataset (\(300\)M images).

The striking performance improvement was due to the reduced inductive bias that characterizes Vision Transformers. By making fewer assumptions about the data, Vision Transformers could better adapt themselves to the given task. However, this ability came at a cost – when the sample size was too small (such as in the ImageNet case), the models overfit, resulting in degraded performance.

The goal for many follow-up papers would be that of matching (and surpassing) the performance of the best convolutional models in the “small” data regime – ImageNet (which is after all over a million images) and below.

Stronger data augmentation allows more efficient learning

Training data-efficient image transformers & distillation through attention20 was the first paper to show that models based on ViTs could be competitive on ImageNet without access to additional data.

Two main contributions characterize the paper:

The Data efficient image Transformer hard-label distillation procedure.

Figure 2: The Data efficient image Transformer hard-label distillation procedure.

The resulting models, called Data efficient image Transformers (DeiTs), were competitive with EfficientNet on the accuracy/step time trade-off, proving that ViT-based models could compete with highly performant CNNs even in the ImageNet data regime. It is however worth noting that DeiTs performed much worse than EfficientNets27 in terms of accuracy/parameters and accuracy/FLOPs.

Incorporating relative Self-Attention in ResNet design

Bottleneck Transformers for Visual Recognition28 investigated a family of hybrid convolution and attention models obtained by incorporating Multi-Head Self-Attention (MHSA) in ResNet designs. In particular, the authors showed that by simply replacing the \(3\times 3\) convolutions with relative position MHSA layers in the bottleneck blocks of the last stage of a ResNet, it was possible to obtain an improvement over several baselines.

Once additional bells-and-whistles were added, such as Squeeze-and-Excite layers29 and SiLU non-linearities30, the models (that the authors call Bottleneck Transformer Networks or BoTNets) demonstrated favorable scaling, outperforming EfficientNets in the accuracy/step-time trade-off beyond 83% top-1 accuracy. The worse performance of small BoTNet models could be due to the fact that the authors do not employ the DeiT training recipe to combat overfitting.

Further, the use of ResNet blocks in the initial stages to efficiently learn lower resolution feature maps allowed the models to perform well in both instance segmentation and object detection tasks, where high-resolution images are typical.

Conditional Positional Encodings

Conditional Positional Encodings for Vision Transformers31 studied alternatives to the positional embeddings and class token used in ViTs. In particular, the paper proposes the use of Positional Encodings Generators (PEGs), a module that produces positional encodings dynamically, and the use of global average pooling as an alternative to the (non-translation-invariant) class token.

PEGs reshape the flattened input sequence to 2D, apply a series of convolutional layers (the authors use depthwise separable convolutions) with zero-padding, and flatten the result. Positional information is thus introduced by the presence of zero-padding, unlike in ViTs, where 1D learnable positional embeddings are used. The method has two main advantages:

The Positional Encoding Generator (PEG) module architecture.

Figure 3: The Positional Encoding Generator (PEG) module architecture.

The resulting models, called Conditional Positional Encoding Vision Transformers (CPVTs), are trained with the DeiT recipe and obtain a small boost in performance (particularly when tested on higher resolution images without fine-tuning).

Modeling the local patch structure

Transformer in Transformer32 (TNT) studied the importance of intra-patch structure by introducing an additional transformer block, dedicated to pixel embeddings, inside the transformer blocks used in ViT.

The Transformer in Transformer (TNT) architecture.

Figure 4: The Transformer in Transformer (TNT) architecture.

The output of the inner transformer block is both sent as-is to the next layer and adapted for use in the outer transformer block which can now take into account both inter-patch relationships (like ViT does) and intra-patch structure. The authors also introduced a separate set of positional embeddings, added to the pixel embeddings before entering the encoder.

The models, trained with the DeiT recipe, outperformed both ViTs and DeiTs on ImageNet, achieving higher parameter and FLOP efficiency, although not quite as high as EfficientNet’s.

Deeper and deeper with Vision Transformers

DeepViT: Towards Deeper Vision Transformer33 investigated how the performance of ViTs changed with increasing depth. After 24 transformer blocks, the authors discovered that adding additional blocks did not lead to improved performance, as both the feature maps and the attention matrices end up being remarkably similar to each other, a problem they name Attention Collapse.

To solve this issue, the authors proposed a simple variant of self-attention that they call Re-Attention, characterized by a learnable transformation matrix \(\Theta \in R^{H \times H}\) (where \(H\) is the number of attention heads) applied directly after the softmax. This matrix (which is shared throughout the network but can also be layer-specific) allows the model to establish cross-head communication, diminishing inter-layer feature map similarity.

The authors also reported positive results following an approach similar to the one proposed in LazyFormer34: reusing the attention matrix for the last half of the network caused no degradation in performance.

Attention and feature map cross layer similarity (left); ViT and DeepViT comparison (right).

Figure 5: Attention and feature map cross layer similarity (left); ViT and DeepViT comparison (right).

The resulting models, called Deep Vision Transformers (DeepViTs), are characterized by better performance as depth increases.

Going deeper with Image Transformers35 identified two main issues in DeiT models: the lack of performance improvement (and even performance degradation) at increased network depth and the double objective that characterizes the transformer encoder, which has to model both inter-patch relationships as well as that between the class token and the patch embeddings. To demonstrate the latter, the authors show that a DeiT where the class token is inserted later on in the network outperforms a normal DeiT.

Two main contributions characterize the paper:

LayerScale is characterized by a layer-specific set of learnable diagonal matrices, whose diagonal values are initialized to a small \(\epsilon\), applied to the output of every residual branch. It is conceptually similar to methods like FixUp36 and SkipInit37 but provides the model with more freedom since it is per-channel (unlike other methods that only use a single scalar).

As mentioned before, Class-attention layers allow efficient extraction of information from processed patch embeddings. The authors find it a better alternative than the simple late-stage insertion of the class token and global average pooling, achieving the same accuracy with lower computational costs.

ViT (left), late class token ViT (center) and CaiT (right).

Figure 6: ViT (left), late class token ViT (center) and CaiT (right).

The authors also proposed the use of additional bells and whistles, such as Talking Heads Attention38.

The resulting models, called Class-attention image Transformers (CaiTs), achieved notable performance on the ImageNet benchmark. CaiTs even outperform the recent NFNets39 in both the accuracy/parameters trade-off and the accuracy/FLOPs trade-off when using both the DeiT training recipe as well as the DeiT distillation technique.

Convolutional insights

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases40 investigated the exciting possibility of initializing self-attention blocks with soft convolutional biases. Basing their work on studies regarding the theoretical relationship between self-attention and convolutional layers41, the authors introduced Gated Positional Self-Attention (GPSA), a variant of self-attention which is characterized by the possibility of being initialized with a locality bias.

More precisely, the initialization of a GPSA block is parameterized by a head-specific center of attention (the position to which the head pays most attention to, given the query patch) and a locality strength (which determines how focused every head is around its center of attention). GPSA blocks also employ a gating mechanism to better balance content and positional information. By appropriately setting the centers of attention and the locality strength of GPSA blocks, the model can compete with CNNs in the low data regime and at the same time enjoy ViT-like expressive power in large data regimes.

Input image (top left), attention maps of an untrained SA block (top right) and GPSA blocks (center and bottom).

Figure 7: Input image (top left), attention maps of an untrained SA block (top right) and GPSA blocks (center and bottom).

The authors demonstrated the effectiveness of this elegant approach through an empirical study comparing DeiTs to the resulting models, which they named Convolutional Vision Transformers (ConViTs): on ImageNet ConViTs enjoy progressively superior performance as the sample size is diminished while retaining DeiT-like performance at full sample size.

Two more papers exploring the application of convolutional insights to Vision Transformers are Incorporating Convolution Designs into Visual Transformers42 and LocalViT: Bringing Locality to Vision Transformers43.

The first paper has three main contributions:

The authors adopted the DeiT training recipe; their models, called Convolution-enhanced image Transformers (CeiTs), obtained superior results against both same-size DeiT models and same-size distilled DeiT models.

The second paper also studied the application of depth-wise convolutions in the feedforward networks while ablating the use of different activation functions and the application of Squeeze-and-Excite layers.

The authors applied this approach to several models, obtaining favorable results.

The importance of Hierarchy

Several papers have studied the application of a hierarchical structure to Vision Transformers.

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet44 introduced the Token-to-Token module (T2T), a module that reshapes the input sequence to a 2D structure, applies a soft split (allowing overlapping patches), and flattens the resulting patches. By adjusting the patch size used in the module, the length of the token sequences diminishes progressively throughout the network.

The Token-to-Token module architecture.

Figure 8: The Token-to-Token module architecture.

The authors obtained favorable results, with performance comparable to that of MobileNets45.

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions46 introduced a variant of Self-Attention called spatial-reduction attention (SRA), characterized by spatial reduction of both keys and values. By applying SRA at the end of several stages, the spatial dimensions of the feature map slowly decrease throughout the model. The resulting models, called Pyramid Vision Transformers (PVTs), can deal with a variety of tasks, including dense prediction, object detection, and semantic segmentation, where high-resolution images are typical.

The Pyramid Vision Transformer.

Figure 9: The Pyramid Vision Transformer.

Scalable Visual Transformers with Hierarchical Pooling47 explored the use of max-pooling to diminish the sequence length progressively. The authors also replaced the class token used in ViT with a final global average pooling layer. The resulting models, called Hierarchical Vision Transformers (HVTs), are trained with the DeiT recipe and scaled up (in the embedding dimension and number of heads) to have a comparable computational cost to DeiTs, achieving gains in the low (sub \(5\)) GFLOPs regime.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows48 suggested a different route: use local self-attention inside (non-overlapping, local) windows, allow cross-window communication through so-called shifted window partitioning, and produce a hierarchical representation by progressively merging the windows themselves.

The Swin Transformer shifted window partitioning procedure.

Figure 10: The Swin Transformer shifted window partitioning procedure.

In the last stage of the network, all local windows have been merged, resulting in blocks effectively using global self-attention on a feature map whose spatial dimensions have been significantly decreased. It’s worth noting that this approach scales linearly with image size.

Swin Transformer (left) and ViT (right). Attention is applied on patches (black) inside windows (red).

Figure 11: Swin Transformer (left) and ViT (right). Attention is applied on patches (black) inside windows (red).

The authors also reported positive results applying T5-style relative positional bias in attention blocks and obtained promising results on both ImageNet and ImageNet-21k, as well as in object detection and semantic segmentation tasks.

Rethinking Spatial Dimensions of Vision Transformers49 introduced a novel pooling layer, characterized by a depthwise convolution (for patch embeddings) and a fully connected layer (for the class token). This simple change allowed the models, named Pooling-based Vision Transformers (PiTs), to outperform vanilla Vision Transformers in the ImageNet data regime.

LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference50 studied a hybrid architecture characterized by a longer convolutional stem and shrinking attention blocks that progressively diminish the spatial dimensions of the feature maps throughout the network. The authors also proposed the use of global average pooling instead of the class token, the injection of positional information through attention bias, and the addition of a GELU activation51 in the attention block.

The authors named the models LeViTs, trained them using both the DeiT training recipe and the DeiT distillation procedure, and showed that the networks are capable of high-speed inference, outperforming EfficientNets and DeiTs in the accuracy/step-time trade-off, on both GPU and CPU.

Multi-Scale features and Cross-Attention

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification52 proposed the use of multi-scale features by adapting the Vision Transformer to have two branches:

The branches employ two separate class tokens. Late in the network (after a separate set of positional embeddings is added), cross-branch communication is established through the use of Cross-Attention Fusion blocks.

In particular, inside cross-attention fusion blocks, class tokens are concatenated to the patch embeddings of the other branch and processed by attention before being returned to their respective branch.

The Cross-Attention fusion layer for the large branch.

Figure 12: The Cross-Attention fusion layer for the large branch.

The output of MLP heads based on the large and small branches tokens is then added together to generate the model’s logits.

The resulting models, which the authors named CrossViTs, are trained with the DeiT recipe and enjoy significant performance boosts, achieving better performance than DeiTs twice as large and twice as computationally expensive.

Integrating convolutions in attention

CvT: Introducing Convolutions to Vision Transformers53 can be seen as a complementary approach to Bottleneck Transformers, where instead of using multi-head self-attention inside of a CNN’s final blocks, convolutions (in this case, depthwise separable ones54) are used inside a Vision Transformer’s self-attention blocks. More precisely, Convolutional vision Transformers (CvTs, whose full-name conflicts with ConViTs) are characterized by two main features:

The (strided) Convolutional Projections.

Figure 13: The (strided) Convolutional Projections.

These two features allow the model to progressively reduce the tokens feature and spatial dimensions, allowing CvTs to adopt a hierarchical multi-stage architecture.

It’s worth noting that CvTs do not employ positional embeddings since positional information is retained through the use of convolutions both in the embedding layers as well as in the attention blocks.

The authors adopted the original ViT training recipe and achieved competitive performance when training CvTs on ImageNet. The largest model presented (CvT-W24), once pre-trained on ImageNet-21k, obtained a stunning \(87.7\%\) top-1 accuracy, outperforming BiT-L (an ensemble of CNNs pre-trained on \(20\) times more data) with a fraction of parameters and compute.

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding55 introduced a 2D version of Longformer56, which the authors call Vision Longformer. Conceptually, it is characterized by two different sets of tokens: a set of global tokens, allowed to attend to all tokens, and local tokens, only allowed to attend to global tokens and tokens spatially close to them.

Global and local tokens in Multi-Scale Vision Longformers.

Figure 14: Global and local tokens in Multi-Scale Vision Longformers.

It’s worth noting that, at least in vision-only tasks, global tokens are discarded at the end of every attention block (while local tokens are reshaped and passed to the next one), since they have at that point already fulfilled their role of allowing efficient communication between spatially distant tokens.

The authors adopted DeiT-style training and achieved impressive parameter and FLOP efficiency.

Haloing and Strided local Self-Attention

Scaling Local Self-Attention for Parameter Efficient Visual Backbones57 builds on top of SASA models, by developing a local self-attention-only family of models made more efficient by the newly introduced haloing operation.

More precisely, the authors introduced a novel strategy called Blocked Self-Attention: input is first divided into blocks (which will be used as queries), and neighboring pixels are banded together and padded, in an operation denominated haloing, to generate keys and values. Finally, attention is applied.

Blocked local Self-Attention as used in HaloNets.

Figure 15: Blocked local Self-Attention as used in HaloNets.

It’s worth noting that this operation causes layers not to be translationally equivariant, but the authors took this route to obtain higher hardware utilization.

The network assumes a hierarchical structure thanks to a strided version of self-attention that is applied at the end of every stage (thus replacing SASA models post-attention average pooling).

The resulting models, called HaloNets, achieved extremely high parameter efficiency, slightly surpassing EfficientNets, a feat not obtained by any other model thus far. It’s worth, however, noting that HaloNets have a longer step-time, particularly when using larger configurations.

Second-order pooling

So-ViT: Mind Visual Tokens for Vision Transformer58 recently proposed the use of second-order pooling to extract high-level information from visual tokens (that is all tokens apart from the class one). The model’s logits are finally obtained by summing up the output of two separate linear heads, one applied to the class token, and one applied to the pooled features.

The So-ViT architecture.

Figure 16: The So-ViT architecture.

The resulting models, named Second-order ViTs (So-ViTs), are trained with an expanded DeiT recipe, and obtain competitive results.

A Comparison for Image Classification

Before continuing, we recap the reported performance of most models presented up to this point.

To provide a fair comparison, we only consider models trained at the same resolution (\(224 \times 224\)) apart from ViTs that were trained at \(384 \times 384\). We also denote models fine-tuned at higher (\(384\times384\)) resolution with an upward arrow, and exclude models trained for more than \(400\) epochs or using the DeiT hard-label distillation technique, because only the DeiT, T2T, and CaiT papers report results using it. Some convolutional models (EfficientNets, NFNets, ResNet-RS and EfficientNetV2s) are included for reference, even though they use both different recipes and training resolution.

We follow An Analysis of Deep Neural Network Models for Practical Applications59 in graphing ImageNet top-1 accuracy vs. FLOPs and parameter count. A lot of models are displayed, so make sure to enable and disable the display of the various models (by clicking them in the legend) as to better explore the visualization.

It’s worth keeping in mind that FLOP use and parameter count are not necessarily representative of latency or memory consumption.

As the recent ResNet-RS paper60 explains, “in custom hardware architectures (e.g. TPUs and GPUs), operations are often bounded by memory access costs and have different levels of optimization on modern matrix multiplication units.” For these reasons, FLOPs are a particularly poor proxy for latency time.

Similarly, the number of parameters is a poor indicator of memory consumption during training. Again from the ResNet-RS paper, “parameter count does not necessarily dictate memory consumption during training because memory is often dominated by the size of the activations” which have to be stored to execute the backpropagation algorithm. “At inference, activations can be discarded and parameter count is a better proxy for actual memory consumption.”

ResNet-RS models are a great example of this issue: an ResNet-RS has respectively \(3{-}4\times\) and \(2\times\) the amount of parameters and FLOPs of a similarly accurate EfficientNet, yet it is \(3\times\) as fast and consumes about \(2\times\) less memory. For these reasons, it would be preferable for authors of new studies to include, together with parameter count and FLOP use, also latency and memory usage measurements.

Robustness and Equivariance

Understanding Robustness of Transformers for Image Classification61 studied the robustness of Vision Transformers to input, model, and adversarial perturbations. The authors found Vision Transformers to have an advantage compared to CNNs of similar size for both input and model perturbations (particularly in large data regimes). However, for some kinds of adversarial attacks, Vision Transformers were found to be more susceptible, at least for small sample sizes.

The authors also found vanilla Vision Transformers highly redundant, a finding first reported in DeepViTs, suggesting the possible use of heavy pruning as a path to more efficient models.

In Group Equivariant Stand-Alone Self-Attention For Vision62, the authors developed a self-attention formulation equivariant to arbitrary symmetry groups. The resulting models, called Group Self-Attention Networks (GSA-Nets), enjoy superior parameter efficiency, although not quite reaching that of comparable-size equivariant CNNs.

Vision Transformers in Self-Supervised Learning

Building on top of previous studies (iGPT63 and masked patch prediction in the Vision Transformers paper19), two papers have investigated self-supervised approaches using Vision Transformers.

An Empirical Study of Training Self-Supervised Vision Transformers64 identified an instability encountered during self-supervised training, causing performance degradation, particularly at large batch sizes. The authors propose a curious trick: a fixed, random patch projection, unlike normal ViTs where patch embeddings are learned just like the rest of the network.

A comparison of training stability using learned and random patch embeddings.

Figure 17: A comparison of training stability using learned and random patch embeddings.

The authors showed that, by coupling this trick with the use of batch normalization in MLP blocks (instead of Layer Normalization), self-supervised training of Vision Transformers could be made more stable, allowing the models to achieve superior performance on several downstream tasks.

Unfortunately, the authors also noted that this approach was not sufficient: the instability is alleviated but not solved, particularly at large learning rates.

SiT: Self-supervised vIsion Transformer65, studied a mix of three semi-supervised tasks (rotation prediction, image reconstruction, and a contrastive one) that, when combined, allow Vision Transformers to perform well on downstream tasks.

The Self-supervised vIsion Transformer architecture.

Figure 18: The Self-supervised vIsion Transformer architecture.

Vision Transformers in Medical Research

Vision Transformers have recently been applied to medical research, mainly in segmentation tasks and diagnosis prediction ones.

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation66 explored using a hybrid CNN-ViT encoder in an UNet-style architecture they call TransUNet. The authors reported favorable results in several segmentation tasks, especially when using a cascaded upsampler to obtain the final segmentation mask.

In Medical Transformer: Gated Axial-Attention for Medical Image Segmentation67, the authors proposed the use of two branches, a shallower one, operating on the global context of the image, and a deeper one, operating locally on patches. The basic building block used for the model, which they named Medical Transformer (MedT), is (gated) axial self-attention16. The authors obtained improvements over baselines on several medical segmentation tasks.

The Medical Transformer architecture.

Figure 19: The Medical Transformer architecture.

UNETR: Transformers for 3D Medical Image Segmentation68 adapted ViTs for 3D medical segmentation tasks. The authors showed that a simple adaptation was sufficient to improve over baselines on several 3D segmentation tasks.

Vision Transformers for Video Recognition

Video Transformer Network69 (VTN) proposed the use of a pre-trained 2D spatial backbone (the authors experiment with both CNN-based and ViT-based feature extractors) combined with a Transformer (in this case, a Longformer) operating on the extracted feature maps. The authors showed that this simple approach was competitive with baselines such as SlowFast70.

Is Space-Time Attention All You Need for Video Understanding?71 introduced TimeSformer, an adaption of ViTs to video. After exploring several possible attention variants, the authors proposed one called Divided Space-Time attention. In this approach, frames are first divided into patches and linearly embedded, then fed to a single transformer encoder. Every encoder layer has two attention blocks applied consecutively: the first one on patch embeddings at the same location but in different frames (temporal attention), the second on patch embeddings in the same frame (spatial attention).

The authors adopted ImageNet pretraining for spatial attention blocks and showed that the resulting models achieved state of the art, outperforming previous baselines on standard datasets such as Kinetics-40072.

An Image is Worth 16x16 Words, What is a Video Worth?73 proposed using a Vision Transformer to model the relationship between intra-frame patches, followed by a (temporal) Transformer encoder, modeling inter-frame relationships. It is conceptually similar to a VTN with a ViT-based spatial backbone.

The resulting models (called Spatio and Temporal Transformers, or STAMs) outperformed strong baselines such as X3D74 in the accuracy/FLOPs trade-off.

ViViT: A Video Vision Transformer75 discusses several approaches to adapt ViTs to video, and found the use of tubelet embeddings, linear projections of spatio-temporal tubes, to be most effective.

Tubelet embeddings in ViViT.

Figure 20: Tubelet embeddings in ViViT.

The positional embeddings are initialized by temporally repeating positional embeddings obtained from pre-trained ViTs. Tubelet embeddings are instead initialized using “central frame initialization”, where the temporally central frame’s embedding is initialized using pre-trained ViT patch embeddings, while the rest are initialized with zeros.

The resulting models, called Video Vision Transformers (ViViTs), achieved state of the art, outperforming TimeSformers, on several standard benchmarks. The authors even reported experiments using an extremely large backbone called ViViT-H, initialized from the similarly large ViT-H pre-trained on JFT-300M. The model obtained a substantial performance improvement, albeit with significant computational cost.

Recently, Multiscale Vision Transformers76 proposed the adoption of a hierarchical structure by progressively increasing the feature dimension while decreasing the spatiotemporal one. This is achieved through the use of pooling, applied to keys and values in attention blocks.

The authors demonstrated the effectiveness of this approach by showing how the resulting models, called Multiscale Vision Transformers (MViTs), could outperform TimeSFormers and ViViTs on standard benchmarks while using a fraction of the compute. Importantly, this method does not require additional data (unlike TimeSFormers and ViViTs, that required ImageNet pretraining).


At the time of writing, the only convolutional architecture still unsurpassed by transformers in efficiency is the recent EfficientNetV2 family of models77, obtained through extensive neural architecture search. No similarly extensive search has been performed for ViT-based models, but the small one reported in CvT: Introducing Convolutions to Vision Transformers53 achieved promising results (CvT-13-NAS).

From recent research, we can draw several conclusions:

The success of transformers in vision has had far-reaching consequences. We have discussed their influence on medical and video tasks, but their impact goes as far as audio (with AST78, Keyword Transformer79) and multimodal architectures (such as ViLT80 and VATT81). Their success has also inspired architectures with even fewer inductive biases, such as the Perceiver82.

One cannot help but feel excited as more and more researchers from all over the artificial intelligence field converge toward common, more general architectures.


Thanks to Simone Scardapane, Andrea Panizza, Amedeo Buonanno and Iacopo Poli for help in proofreading and improving this work.

1. Attention is all you need.
Vaswani, A., Shazeer, N., Parmar, N., et al., 2017.
2. BERT: Pre-training of deep bidirectional transformers for language understanding.
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K., 2019.
3. Megatron-LM: Training multi-billion parameter language models using model parallelism.
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J. & Catanzaro, B., 2019.
4. Exploring the limits of transfer learning with a unified text-to-text transformer.
Raffel, C., Shazeer, N., Roberts, A., et al., 2020.
5. Improving language understanding with unsupervised learning.
Radford, A., 2019.
6. Better language models and their implications.
Radford, A., 2018.
7. Language models are few-shot learners.
Brown, T. B., Mann, B., Ryder, N., et al., 2020.
8. Scaling laws for neural language models.
Kaplan, J., McCandlish, S., Henighan, T., et al., 2020.
9. Scaling laws for transfer.
Hernandez, D., Kaplan, J., Henighan, T. & McCandlish, S., 2021.
10. Attention augmented convolutional networks.
Bello, I., Zoph, B., Le, Q., Vaswani, A. & Shlens, J., 2019.
11. Stand-alone self-attention in vision models.
Parmar, N., Ramachandran, P., Vaswani, A., Bello, I., Levskaya, A. & Shlens, J., 2019.
12. End-to-end object detection with transformers.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. & Zagoruyko, S., 2020.
13. Visual transformers: Token-based image representation and processing for computer vision.
Wu, B., Xu, C., Dai, X., et al., 2020.
14. LambdaNetworks: Modeling long-range interactions without attention.
Bello, I., 2021.
15. Image transformer.
Parmar, N., Vaswani, A., Uszkoreit, J., et al., 2018.
16. Axial attention in multidimensional transformers.
Ho, J., Kalchbrenner, N., Weissenborn, D. & Salimans, T., 2019.
17. A survey on visual transformer.
Han, K., Wang, Y., Chen, H., et al., 2020.
18. Transformers in vision: A survey.
Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S. & Shah, M., 2021.
19. An image is worth 16x16 words: Transformers for image recognition at scale.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al., 2021.
20. Training data-efficient image transformers & distillation through attention.
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A. & Jégou, H., 2020.
21. Deep networks with stochastic depth.
Huang, G., Sun, Y., Liu, Z., Sedra, D. & Weinberger, K. Q., 2016.
22. RandAugment: Practical automated data augmentation with a reduced search space.
Cubuk, E. D., Zoph, B., Shlens, J. & Le, Q., 2020.
23. Mixup: Beyond empirical risk minimization.
Zhang, H., Cissé, M., Dauphin, Y. N. & Lopez-Paz, D., 2018.
24. CutMix: Regularization strategy to train strong classifiers with localizable features.
Yun, S., Han, D., Chun, S., Oh, S. J., Yoo, Y. & Choe, J., 2019.
25. Random erasing data augmentation.
Zhong, Z., Zheng, L., Kang, G., Li, S. & Yang, Y., 2020.
26. Dropout: A simple way to prevent neural networks from overfitting.
Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R., 2014.
27. EfficientNet: Rethinking model scaling for convolutional neural networks.
Tan, M. & Le, Q. V., 2019.
28. Bottleneck transformers for visual recognition.
Srinivas, A., Lin, T.-Y., Parmar, N., Shlens, J., Abbeel, P. & Vaswani, A., 2021.
29. Squeeze-and-excitation networks.
Hu, J., Shen, L., Albanie, S., Sun, G. & Wu, E., 2020.
30. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.
Elfwing, S., Uchibe, E. & Doya, K., 2018.
31. Conditional positional encodings for vision transformers.
Chu, X., Tian, Z., Zhang, B., et al., 2021.
32. Transformer in transformer.
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C. & Wang, Y., 2021.
33. DeepViT: Towards deeper vision transformer.
Zhou, D., Kang, B., Jin, X., et al., 2021.
34. LazyFormer: Self attention with lazy update.
Ying, C., Ke, G., He, D. & Liu, T.-Y., 2021.
35. Going deeper with image transformers.
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G. & Jégou, H., 2021.
36. Fixup initialization: Residual learning without normalization.
Zhang, H., Dauphin, Y. N. & Ma, T., 2019.
37. Batch normalization biases residual blocks towards the identity function in deep networks.
De, S. & Smith, S. L., 2020.
38. Talking-heads attention.
Shazeer, N., Lan, Z., Cheng, Y., Ding, N. & Hou, L., 2020.
39. High-performance large-scale image recognition without normalization.
Brock, A., De, S., Smith, S. L. & Simonyan, K., 2021.
40. ConViT: Improving vision transformers with soft convolutional inductive biases.
d’Ascoli, S., Touvron, H., Leavitt, M. L., Morcos, A. S., Biroli, G. & Sagun, L., 2021.
41. On the relationship between self-attention and convolutional layers.
Cordonnier, J.-B., Loukas, A. & Jaggi, M., 2020.
42. Incorporating convolution designs into visual transformers.
Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F. & Wu, W., 2021.
43. LocalViT: Bringing locality to vision transformers.
Li, Y., Zhang, K., Cao, J., Timofte, R. & Gool, L. V., 2021.
44. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet.
Yuan, L., Chen, Y., Wang, T., et al., 2021.
45. MobileNetV2: Inverted residuals and linear bottlenecks.
Sandler, M., Howard, A. G., Zhu, M., Zhmoginov, A. & Chen, L.-C., 2018.
46. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions.
Wang, W., Xie, E., Li, X., et al., 2021.
47. Scalable visual transformers with hierarchical pooling.
Pan, Z., Zhuang, B., Liu, J., He, H. & Cai, J., 2021.
48. Swin transformer: Hierarchical vision transformer using shifted windows.
Liu, Z., Lin, Y., Cao, Y., et al., 2021.
49. Rethinking spatial dimensions of vision transformers.
Heo, B., Yun, S., Han, D., Chun, S., Choe, J. & Oh, S. J., 2021.
50. LeViT: A vision transformer in ConvNet’s clothing for faster inference.
Graham, B., El-Nouby, A., Touvron, H., et al., 2021.
51. Bridging nonlinearities and stochastic regularizers with gaussian error linear units.
Hendrycks, D. & Gimpel, K., 2016.
52. CrossViT: Cross-attention multi-scale vision transformer for image classification.
Chen, C.-F., Fan, Q. & Panda, R., 2021.
53. CvT: Introducing convolutions to vision transformers.
Wu, H., Xiao, B., Codella, N., et al., 2021.
54. Xception: Deep learning with depthwise separable convolutions.
Chollet, F., 2017.
55. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding.
Zhang, P., Dai, X., Yang, J., et al., 2021.
56. Longformer: The long-document transformer.
Beltagy, I., Peters, M. E. & Cohan, A., 2020.
57. Scaling local self-attention for parameter efficient visual backbones.
Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B. A. & Shlens, J., 2021.
58. So-ViT: Mind visual tokens for vision transformer.
Xie, J., Zeng, R., Wang, Q., Zhou, Z. & Li, P., 2021.
59. An analysis of deep neural network models for practical applications.
Canziani, A., Paszke, A. & Culurciello, E., 2016.
60. Revisiting ResNets: Improved training and scaling strategies.
Bello, I., Fedus, W., Du, X., et al., 2021.
61. Understanding robustness of transformers for image classification.
Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T. & Veit, A., 2021.
62. Group equivariant stand-alone self-attention for vision.
Romero, D. W. & Cordonnier, J.-B., 2021.
63. Generative pretraining from pixels.
Chen, M., Radford, A., Child, R., et al., 2020.
64. An empirical study of training self-supervised vision transformers.
Chen, X., Xie, S. & He, K., 2021.
65. SiT: Self-supervised vIsion transformer.
Ahmed, S. A. A., Awais, M. & Kittler, J., 2021.
66. TransUNet: Transformers make strong encoders for medical image segmentation.
Chen, J., Lu, Y., Yu, Q., et al., 2021.
67. Medical transformer: Gated axial-attention for medical image segmentation.
Valanarasu, J. M. J., Oza, P., Hacihaliloglu, I. & Patel, V. M., 2021.
68. UNETR: Transformers for 3D medical image segmentation.
Hatamizadeh, A., Yang, D., Roth, H. & Xu, D., 2021.
69. Video transformer network.
Neimark, D., Bar, O., Zohar, M. & Asselmann, D., 2021.
70. SlowFast networks for video recognition.
Feichtenhofer, C., Fan, H., Malik, J. & He, K., 2019.
71. Is space-time attention all you need for video understanding?
Bertasius, G., Wang, H. & Torresani, L., 2021.
72. The kinetics human action video dataset.
Kay, W., Carreira, J., Simonyan, K., et al., 2017.
73. An image is worth 16x16 words, what is a video worth?
Sharir, G., Noy, A. & Zelnik-Manor, L., 2021.
74. X3D: Expanding architectures for efficient video recognition.
Feichtenhofer, C., 2020.
75. ViViT: A video vision transformer.
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M. & Schmid, C., 2021.
76. Multiscale vision transformers.
Fan, H., Xiong, B., Mangalam, K., et al., 2021.
77. EfficientNetV2: Smaller models and faster training.
Tan, M. & Le, Q. V., 2021.
78. AST: Audio spectrogram transformer.
Gong, Y., Chung, Y.-A. & Glass, J. R., 2021.
79. Keyword transformer: A self-attention model for keyword spotting.
Berg, A., O’Connor, M. & Cruz, M. T., 2021.
80. ViLT: Vision-and-language transformer without convolution or region supervision.
Kim, W., Son, B. & Kim, I., 2021.
81. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text.
Akbari, H., Yuan, L., Qian, R., et al., 2021.
82. Perceiver: General perception with iterative attention.
Jaegle, A., Gimeno, F., Brock, A., Zisserman, A., Vinyals, O. & Carreira, J., 2021.



Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".


For attribution, please cite this work as

Zanichelli (2021, April 28). IAML Distill Blog: Transformers in Vision. Retrieved from

BibTeX citation

  author = {Zanichelli, Niccolò},
  title = {IAML Distill Blog: Transformers in Vision},
  url = {},
  year = {2021}