Vision Transformer Inputs

Patch embedding and positional embedding

This animation visualizes the transition from flattened patch vectors to projected patch embeddings, then shows how class token and positional embedding are applied before entering the Transformer encoder.

Raw patch dim
-
Projected dim
-
Projection matrix E
-
Sequence shape
-

1) Flattened patch token xpi

Dimension = C x P x P (raw patch values).

2) Linear projection E

Map xpi from C x P x P to D with trainable matrix E.

Input: R48 Linear layer Output: R64
E (learnable)
W: 48 x 64
b: 64
xpiE maps from RC·P·P to RD

3) Sequence + positional embedding

Prepend class token and add Epos to each token embedding.

Step 1 of 3
Step 1: A single patch token xₚⁱ is a flattened vector with C x P x P values.