Demystifying Visual Transformer (ViT)
🧠 Paper Summary: An Image is Worth 16x16 Words — Transformers for Image Recognition at Scale
Authors: Alexey Dosovitskiy et al.
Published by: Google Research, Brain Team (2020)
Link: arxiv.org/pdf/2010.11929
📌 Overview
This paper introduces the Vision Transformer (ViT) — the first major attempt to apply a Transformer architecture, originally designed for NLP, directly to image classification tasks.
The key idea: instead of processing pixels through convolutional layers, an image is split into patches, each treated as a “token” (similar to words in a sentence).
ViT demonstrates that with sufficient data and compute, Transformers can outperform CNNs in image recognition — marking a major shift in computer vision research.
🧩 Key Concepts
1. Patch Embedding
- The input image (e.g., 224×224) is divided into fixed-size patches (e.g., 16×16 pixels).
- Each patch is flattened and linearly projected to form a vector embedding.
- These patch embeddings serve as input tokens for the Transformer encoder.
2. Positional Encoding
- Since Transformers have no inherent notion of spatial order, positional embeddings are added to patch embeddings to preserve spatial relationships.
3. Transformer Encoder
- The encoder consists of multi-head self-attention (MHSA) and MLP blocks, similar to the standard NLP Transformer (Vaswani et al., 2017).
- Layer normalization and residual connections are applied after each sub-layer.
4. Classification Token ([CLS])
- A special learnable token is prepended to the sequence of patch embeddings.
- The final representation of this token is used for image classification.
⚙️ Training Setup
- Pretraining Dataset: JFT-300M (≈300M images, 18k classes) and ImageNet-21k (14M images, 21k classes).
- Fine-tuning: After pretraining, the model is fine-tuned on smaller datasets like ImageNet, CIFAR-100, and VTAB.
- Optimization: Adam optimizer with a cosine learning rate schedule and heavy data augmentation.
📈 Results
| Model | Pretraining Dataset | ImageNet Accuracy | Params |
|---|---|---|---|
| ViT-B/16 | ImageNet-21k | 84.0% | 86M |
| ViT-L/16 | JFT-300M | 88.5% | 307M |
| ViT-H/14 | JFT-300M | 90.0% | 632M |
- ViT outperforms ResNet and other CNN-based models when pretrained on large datasets.
- On smaller datasets (without large-scale pretraining), ViT underperforms due to weaker inductive biases compared to CNNs.
🧠 Insights & Contributions
- Transformers can scale to vision tasks with sufficient data and computational power.
- CNN inductive biases (translation invariance, locality) are not strictly necessary for high performance — they can be learned from data.
- ViT models transfer well across different datasets via fine-tuning.
- Demonstrated the emerging scalability law — model performance increases predictably with data and compute.
⚖️ Pros & Cons
| Pros | Cons |
|---|---|
| Simple, elegant architecture — no convolutions. | Requires massive data for effective training. |
| Strong transfer learning performance. | Computationally expensive at large scale. |
| Flexible — easily extended to detection, segmentation, etc. | Poor inductive bias for small data settings. |
🔍 Follow-up Work
- DeiT (Data-efficient Image Transformer): Improves ViT training efficiency with smaller datasets using knowledge distillation.
- Swin Transformer: Introduces hierarchical structure and local attention windows for dense prediction tasks.
- Hybrid Models: Combine CNN feature extractors with ViT-style attention blocks.
🧭 Takeaway
“ViT marks the turning point where pure Transformers became competitive with CNNs in computer vision — given enough data.”
It simplifies the architecture but relies on data scale and transfer learning, setting the foundation for the transformer revolution in visual tasks.
Prepared for technical sharing session — concise summary by Rizky.