Transformer Feed-Forward
See how GLU splits into two streams, applies activation, gates values, and projects back.
d model token goes through one linear layer, then splits into a and gate.
activation only on a branch
element wise modulation
back to d model