Vision Transformer Inputs
This animation visualizes the transition from flattened patch vectors to projected patch embeddings, then shows how class token and positional embedding are applied before entering the Transformer encoder.
Dimension = C x P x P (raw patch values).
Map xpi from C x P x P to D with trainable matrix E.
Prepend class token and add Epos to each token embedding.