Patch Embedding and Positional Embedding

Vision Transformer Inputs

Patch embedding and positional embedding

This animation visualizes the transition from flattened patch vectors to projected patch embeddings, then shows how class token and positional embedding are applied before entering the Transformer encoder.

Channels (C)

Patch size (P)

Embedding dim (D)

Num patches (N)

Raw patch dim

Projected dim

Projection matrix E

Sequence shape

1) Flattened patch token x_pⁱ

Dimension = C x P x P (raw patch values).

→

2) Linear projection E

Map x_pⁱ from C x P x P to D with trainable matrix E.

Input: R⁴⁸ Linear layer Output: R⁶⁴

→

E (learnable)

W: 48 x 64

b: 64

→

x_pⁱE maps from R^C·P·P to R^D

→

3) Sequence + positional embedding

Prepend class token and add E_pos to each token embedding.

Step 1 of 3

Step 1: A single patch token xₚⁱ is a flattened vector with C x P x P values.