🧠 Paper Summary: An Image is Worth 16x16 Words — Transformers for Image Recognition at Scale

Authors: Alexey Dosovitskiy et al.
Published by: Google Research, Brain Team (2020)
Link: arxiv.org/pdf/2010.11929

📌 Overview

This paper introduces the Vision Transformer (ViT) — the first major attempt to apply a Transformer architecture, originally designed for NLP, directly to image classification tasks.
The key idea: instead of processing pixels through convolutional layers, an image is split into patches, each treated as a “token” (similar to words in a sentence).

ViT demonstrates that with sufficient data and compute, Transformers can outperform CNNs in image recognition — marking a major shift in computer vision research.

🧩 Key Concepts

1. Patch Embedding

The input image (e.g., 224×224) is divided into fixed-size patches (e.g., 16×16 pixels).
Each patch is flattened and linearly projected to form a vector embedding.
These patch embeddings serve as input tokens for the Transformer encoder.

2. Positional Encoding

Since Transformers have no inherent notion of spatial order, positional embeddings are added to patch embeddings to preserve spatial relationships.

3. Transformer Encoder

The encoder consists of multi-head self-attention (MHSA) and MLP blocks, similar to the standard NLP Transformer (Vaswani et al., 2017).
Layer normalization and residual connections are applied after each sub-layer.

4. Classification Token ([CLS])

A special learnable token is prepended to the sequence of patch embeddings.
The final representation of this token is used for image classification.

⚙️ Training Setup

Pretraining Dataset: JFT-300M (≈300M images, 18k classes) and ImageNet-21k (14M images, 21k classes).
Fine-tuning: After pretraining, the model is fine-tuned on smaller datasets like ImageNet, CIFAR-100, and VTAB.
Optimization: Adam optimizer with a cosine learning rate schedule and heavy data augmentation.

📈 Results

Model	Pretraining Dataset	ImageNet Accuracy	Params
ViT-B/16	ImageNet-21k	84.0%	86M
ViT-L/16	JFT-300M	88.5%	307M
ViT-H/14	JFT-300M	90.0%	632M

ViT outperforms ResNet and other CNN-based models when pretrained on large datasets.
On smaller datasets (without large-scale pretraining), ViT underperforms due to weaker inductive biases compared to CNNs.

🧠 Insights & Contributions

Transformers can scale to vision tasks with sufficient data and computational power.
CNN inductive biases (translation invariance, locality) are not strictly necessary for high performance — they can be learned from data.
ViT models transfer well across different datasets via fine-tuning.
Demonstrated the emerging scalability law — model performance increases predictably with data and compute.

⚖️ Pros & Cons

Pros	Cons
Simple, elegant architecture — no convolutions.	Requires massive data for effective training.
Strong transfer learning performance.	Computationally expensive at large scale.
Flexible — easily extended to detection, segmentation, etc.	Poor inductive bias for small data settings.

🔍 Follow-up Work

DeiT (Data-efficient Image Transformer): Improves ViT training efficiency with smaller datasets using knowledge distillation.
Swin Transformer: Introduces hierarchical structure and local attention windows for dense prediction tasks.
Hybrid Models: Combine CNN feature extractors with ViT-style attention blocks.

🧭 Takeaway

“ViT marks the turning point where pure Transformers became competitive with CNNs in computer vision — given enough data.”

It simplifies the architecture but relies on data scale and transfer learning, setting the foundation for the transformer revolution in visual tasks.

Prepared for technical sharing session — concise summary by Rizky.