Demystifying Visual Transformer (ViT)

🧠 Paper Summary: An Image is Worth 16x16 Words — Transformers for Image Recognition at Scale

Authors: Alexey Dosovitskiy et al.
Published by: Google Research, Brain Team (2020)
Link: arxiv.org/pdf/2010.11929


📌 Overview

This paper introduces the Vision Transformer (ViT) — the first major attempt to apply a Transformer architecture, originally designed for NLP, directly to image classification tasks.
The key idea: instead of processing pixels through convolutional layers, an image is split into patches, each treated as a “token” (similar to words in a sentence).

ViT demonstrates that with sufficient data and compute, Transformers can outperform CNNs in image recognition — marking a major shift in computer vision research.


🧩 Key Concepts

1. Patch Embedding

  • The input image (e.g., 224×224) is divided into fixed-size patches (e.g., 16×16 pixels).
  • Each patch is flattened and linearly projected to form a vector embedding.
  • These patch embeddings serve as input tokens for the Transformer encoder.

2. Positional Encoding

  • Since Transformers have no inherent notion of spatial order, positional embeddings are added to patch embeddings to preserve spatial relationships.

3. Transformer Encoder

  • The encoder consists of multi-head self-attention (MHSA) and MLP blocks, similar to the standard NLP Transformer (Vaswani et al., 2017).
  • Layer normalization and residual connections are applied after each sub-layer.

4. Classification Token ([CLS])

  • A special learnable token is prepended to the sequence of patch embeddings.
  • The final representation of this token is used for image classification.

⚙️ Training Setup

  • Pretraining Dataset: JFT-300M (≈300M images, 18k classes) and ImageNet-21k (14M images, 21k classes).
  • Fine-tuning: After pretraining, the model is fine-tuned on smaller datasets like ImageNet, CIFAR-100, and VTAB.
  • Optimization: Adam optimizer with a cosine learning rate schedule and heavy data augmentation.

📈 Results

Model Pretraining Dataset ImageNet Accuracy Params
ViT-B/16 ImageNet-21k 84.0% 86M
ViT-L/16 JFT-300M 88.5% 307M
ViT-H/14 JFT-300M 90.0% 632M
  • ViT outperforms ResNet and other CNN-based models when pretrained on large datasets.
  • On smaller datasets (without large-scale pretraining), ViT underperforms due to weaker inductive biases compared to CNNs.

🧠 Insights & Contributions

  1. Transformers can scale to vision tasks with sufficient data and computational power.
  2. CNN inductive biases (translation invariance, locality) are not strictly necessary for high performance — they can be learned from data.
  3. ViT models transfer well across different datasets via fine-tuning.
  4. Demonstrated the emerging scalability law — model performance increases predictably with data and compute.

⚖️ Pros & Cons

Pros Cons
Simple, elegant architecture — no convolutions. Requires massive data for effective training.
Strong transfer learning performance. Computationally expensive at large scale.
Flexible — easily extended to detection, segmentation, etc. Poor inductive bias for small data settings.

🔍 Follow-up Work

  • DeiT (Data-efficient Image Transformer): Improves ViT training efficiency with smaller datasets using knowledge distillation.
  • Swin Transformer: Introduces hierarchical structure and local attention windows for dense prediction tasks.
  • Hybrid Models: Combine CNN feature extractors with ViT-style attention blocks.

🧭 Takeaway

“ViT marks the turning point where pure Transformers became competitive with CNNs in computer vision — given enough data.”

It simplifies the architecture but relies on data scale and transfer learning, setting the foundation for the transformer revolution in visual tasks.


Prepared for technical sharing session — concise summary by Rizky.