Vision Transformer
An image is worth 16 x 16 Words
Last updated
An image is worth 16 x 16 Words
Last updated
TL;DR, what the researcher did was that they found a way to transform the input image into "tokens" so that they can be fed into the Transformer.
To do this, they broke up the input into different patches and then transform those patches into tokens individually. For example, a 224 * 224 image can be broken down into 14 * 14 different 16 * 16 patches. The input can then be viewed as a 196 tokens long sequence, with each token being 16*16 dimensions.
The model is basically the original transformer model that is used in NLP tasks. They did this on purpose to prove that Transformer can be used in computer vision tasks without too much of a adaption.