Natural Language Processing
Transformers
Attention Shape Checklist
A note for keeping tensor shapes straight when implementing self-attention from scratch.
When implementing attention by hand, I mostly fail on shape bookkeeping before I fail on theory.
Minimal checklist
Given input x with shape [batch, tokens, dim]:
W_q,W_k,W_vmap the last dimension- after projection, split into heads
- transpose so attention is computed over tokens
- confirm score matrix shape is
[batch, heads, tokens, tokens]
Common mistakes
- Forgetting the transpose between key and query dimensions.
- Applying softmax over the wrong axis.
- Reshaping before confirming contiguous memory layout.
Rule of thumb
If a transformer implementation feels confusing, write down the tensor shape after every line before debugging the math.