Vision Transformers

Patchify - how images become tokens

Vision transformers do not read images pixel by pixel. They split the image into square patches, flatten each patch, and treat each flattened patch as a token. Use the controls to see how this changes token count and dimensions.

Step 1 - divide into patches

The image is divided into a grid of patches. Each patch is a block of pixels per channel.

Original image
Patch grid (hover to highlight)

Step 2 - flatten each patch to a vector

Each patch is concatenated into a vector of length . Click any patch token to inspect its flattened values.

What reshape + permute does

1
reshape(B, C, pH, p, pW, p) splits spatial axes into (number of patches, patch size) blocks.
2
permute(0,2,4,1,3,5) reorders so each patch's values are grouped together.
3
.contiguous() aligns memory layout with the new order.
4
reshape(B, pH*pW, C*p*p) produces final token sequence (B, N, D).