GLU Visual Explainer

Transformer Feed-Forward

GLU visual flow

See how GLU splits into two streams, applies activation, gates values, and projects back.

Variant

d model

d ff gated

1) Split

d model token goes through one linear layer, then splits into a and gate.

Input d model: 256 Output: 2 x 384

→

Linear proj in

W in: 256 x 768

b in: 768

→

a branch

gate branch

shape2 x d ff gated

x -> proj in -> [a, gate]

→

2) Activate a

activation only on a branch

a (raw)

GELU

a hat

activationGELU

a hat = GELU(a)

→

3) Gate multiply

element wise modulation

a hat

→

⊙

→

gated

gate stream multiplies element wise

gate

operationa hat ⊙ gate

gated = a hat ⊙ gate

→

4) Project out

back to d model

Input: d ff gated Output: d model

→

Linear proj out

W out: 384 x 256

b out: 256

→

shaped model

y = proj out(gated)