Computer Vision
  • Introduction
    • Neural network basics
    • MLE and Cross Entropy
    • Convolution basics
    • Neural Network Categories
  • 2D Backbones
    • ResNet
    • Transformer
      • Recurrent Neural Network
      • Vision Transformer
      • SwinTransformer
  • Methods for Object Detection
  • Object Detection
    • The R-CNN family
    • ROI pool & ROI align
    • FCOS
    • Object Detection in Detectron2
  • Segmentation
    • Fully Convolutional Network
    • Unet: image segmentation
  • Video Understanding
    • I3D: video understanding
    • Slowfast: video recognition
    • ActionFormer: temporal action localization
  • Generative models
    • Autoregressive model
    • Variational Auto-Encoder
    • Generative Adversarial Network
    • Diffusion Models
    • 3D Face Reconstruction
Powered by GitBook
On this page
  1. 2D Backbones
  2. Transformer

Vision Transformer

An image is worth 16 x 16 Words

PreviousRecurrent Neural NetworkNextSwinTransformer

Last updated 2 years ago

TL;DR, what the researcher did was that they found a way to transform the input image into "tokens" so that they can be fed into the Transformer.

To do this, they broke up the input into different patches and then transform those patches into tokens individually. For example, a 224 * 224 image can be broken down into 14 * 14 different 16 * 16 patches. The input can then be viewed as a 196 tokens long sequence, with each token being 16*16 dimensions.

The model is basically the original transformer model that is used in NLP tasks. They did this on purpose to prove that Transformer can be used in computer vision tasks without too much of a adaption.