Vision Transformers
Vision transformers do not read images pixel by pixel. They split the image into square patches, flatten each patch, and treat each flattened patch as a token. Use the controls to see how this changes token count and dimensions.
The image is divided into a grid of patches. Each patch is a block of pixels per channel.
Each patch is concatenated into a vector of length .
Click any patch token to inspect its flattened values.
reshape(B, C, pH, p, pW, p) splits spatial axes into
(number of patches, patch size) blocks.
permute(0,2,4,1,3,5) reorders so each patch's values are grouped together.
.contiguous() aligns memory layout with the new order.
reshape(B, pH*pW, C*p*p) produces final token sequence (B, N, D).