ViT — LLMpedia

ViT
Name	ViT
Introduced	2020
Authors	Alexey Dosovitskiy; Lucas Beyer; Alexander Kolesnikov; Dirk Weissenborn; Xiaohua Zhai; Thomas Unterthiner; Mostafa Dehghani; Matthias Minderer; Georg Heigold; Sylvain Gelly; Jakob Uszkoreit; Neil Houlsby
Institutions	Google Research; DeepMind
Original publication	"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"
Type	Vision transformer

Contents

Introduction
Architecture
Training and Optimization
Performance and Benchmarks
Variants and Extensions
Applications
Limitations and Criticisms

ViT Vision Transformer is a neural network architecture that adapts the Transformer from Google and OpenAI research to image recognition, proposing patch-based tokenization and self-attention for vision tasks. Developed by researchers associated with Google Research and DeepMind, the model demonstrated competitive results on benchmarks like ImageNet when pretrained on large datasets, influencing follow-up work across industry labs such as Facebook AI Research, Microsoft Research, DeepMind, OpenAI, Apple, and academic groups at MIT, Stanford University, UC Berkeley, University of Oxford, and ETH Zurich.

Introduction

ViT reframes image understanding by splitting an input image into fixed-size patches, linearly projecting them into tokens, and feeding the sequence into a Transformer encoder similar to models developed at Google for language modeling and by OpenAI for generative pretraining. The model drew on techniques from BERT, GPT-2, GPT-3, and architectures popularized at Google Brain and DeepMind, leveraging transfer learning strategies used in systems by Facebook AI Research and Microsoft Research. Early adopters and evaluators included teams at Stanford University, Carnegie Mellon University, University of Toronto, University of Cambridge, and industrial labs such as NVIDIA and Intel.

Architecture

ViT's core uses the Transformer encoder introduced by researchers at Google and elaborated in work by Vaswani et al.; it replaces convolutional layers common in models from Kaiming He's groups at Facebook AI Research and Microsoft Research with multi-head self-attention modules. Input images are divided into patches (e.g., 16×16) and linearly embedded into tokens, following concepts related to positional encodings from Vaswani et al. and architectural motifs seen in models by Alex Krizhevsky and Yann LeCun in early convolutional work. The architecture incorporates a classification token akin to techniques in BERT and uses LayerNorm and feed-forward networks comparable to designs from Google Brain and DeepMind teams. Implementation details were benchmarked against models such as ResNet variants by Kaiming He and depthwise approaches evaluated by Xception authors, with hardware considerations relevant to TPU teams at Google Cloud and accelerator design by NVIDIA.

Training and Optimization

ViT training leverages large-scale pretraining on datasets assembled by groups like JFT-300M contributors and initiatives at Google Research and employs optimization techniques popularized by Adam from researchers at Diederik Kingma and Jimmy Ba, weight decay schedules studied at Alexey Dosovitskiy's lab, and data augmentation strategies exemplified by AutoAugment and RandAugment from Google Brain and Google Research authors. Fine-tuning follows transfer-learning protocols established in work by Yoshua Bengio and teams at University of Montreal and incorporates regularization techniques like stochastic depth from Microsoft Research and label smoothing from Szegedy et al. Empirical hyperparameter studies echo best practices from benchmarks by ImageNet teams at Stanford University and optimization insights from OpenAI.

Performance and Benchmarks

ViT achieved strong results on ImageNet and downstream datasets used by research groups at Stanford University, UC Berkeley, MIT, and Carnegie Mellon University when pretrained on very large corpora assembled by Google Research and evaluated on benchmarks curated by the Imagenet Large Scale Visual Recognition Challenge organizers. Comparative studies included models like ResNet families from Kaiming He, hybrid CNN-Transformer designs from Facebook AI Research, and efficient architectures from MobileNet authors at Google. Performance analyses appeared in papers and technical reports from DeepMind, OpenAI, Microsoft Research, and university labs, and were reproduced on hardware by NVIDIA and Google TPU teams.

Variants and Extensions

Following the original ViT, researchers at Facebook AI Research, Google Research, DeepMind, Microsoft Research, OpenAI, Stanford University, University of Oxford, and ETH Zurich proposed hybrid models combining convolutional inductive biases from ResNet or ConvNets with Transformer encoders, lightweight variants for mobile devices inspired by MobileNet and EfficientNet groups, hierarchical designs from Swin Transformer authors at Microsoft Research, masked-image modeling methods analogous to BERT pretraining by Google Research teams, and vision-language extensions such as projects at OpenAI and Microsoft Research integrating ideas from CLIP, ALIGN, and multimodal transformers used by Facebook AI Research. Other extensions include efficient attention mechanisms evaluated by Google Brain and sparse techniques developed by groups at DeepMind and University of Toronto.

Applications

ViT and its variants were applied across computer vision tasks pursued by academic and industry groups: image classification work at ImageNet research teams; object detection research at COCO and lab groups like Facebook AI Research; semantic segmentation projects led by researchers at University of Oxford and MIT; medical imaging collaborations with hospitals and labs affiliated with Harvard Medical School and Stanford Medicine; remote sensing partnerships involving NASA and ESA; and multimodal systems in industry efforts at OpenAI, Google, and Microsoft. Deployments targeted product lines and services by Google Cloud, Azure AI, AWS, Apple, and startups spun out from UC Berkeley and Stanford University.

Limitations and Criticisms

Critiques from researchers at MIT, Stanford University, University of Oxford, Carnegie Mellon University, and corporate labs such as DeepMind and Facebook AI Research note that ViT generally requires large-scale pretraining datasets assembled by teams at Google Research to outperform convolutional models, raising concerns about compute and environmental costs highlighted by studies from OpenAI and DeepMind on compute scaling. Analyses by NVIDIA and Intel engineers discuss hardware optimization challenges relative to convolutional kernels optimized by cuDNN and accelerator teams. Theoretical and robustness critiques appeared in papers from ETH Zurich and University of Toronto exploring adversarial vulnerability and inductive bias deficits compared with architectures from Yann LeCun's groups. Ongoing work at Microsoft Research, Google Research, and universities aims to address data efficiency, interpretability, and fairness issues raised by research ethics teams at Stanford University and Harvard Kennedy School.

Category:Computer vision