Transformer Feed-Forward

GLU visual flow

See how GLU splits into two streams, applies activation, gates values, and projects back.

1) Split

d model token goes through one linear layer, then splits into a and gate.

Input d model: 256 Output: 2 x 384
Linear proj in
W in: 256 x 768
b in: 768
a branch
gate branch
shape2 x d ff gated
x -> proj in -> [a, gate]

2) Activate a

activation only on a branch

a (raw)
GELU
a hat
activationGELU
a hat = GELU(a)

3) Gate multiply

element wise modulation

a hat
gated
gate stream multiplies element wise
gate
operationa hat ⊙ gate
gated = a hat ⊙ gate

4) Project out

back to d model

Input: d ff gated Output: d model
Linear proj out
W out: 384 x 256
b out: 256
shaped model
y = proj out(gated)