Patchify Explainer

Vision Transformers

Patchify - how images become tokens

Vision transformers do not read images pixel by pixel. They split the image into square patches, flatten each patch, and treat each flattened patch as a token. Use the controls to see how this changes token count and dimensions.

Image size Patch size Channels

Step 1 - divide into patches

The image is divided into a grid of patches. Each patch is a block of pixels per channel.

Original image

→

Patch grid (hover to highlight)

Step 2 - flatten each patch to a vector

Each patch is concatenated into a vector of length . Click any patch token to inspect its flattened values.

What reshape + permute does

reshape(B, C, pH, p, pW, p) splits spatial axes into (number of patches, patch size) blocks.

permute(0,2,4,1,3,5) reorders so each patch's values are grouped together.

.contiguous() aligns memory layout with the new order.

reshape(B, pH*pW, C*p*p) produces final token sequence (B, N, D).