Natural Language Processing

Transformers

Attention Shape Checklist

A note for keeping tensor shapes straight when implementing self-attention from scratch.

Mar 27, 2026

nlp
transformer
implementation

When implementing attention by hand, I mostly fail on shape bookkeeping before I fail on theory.

Minimal checklist

Given input x with shape [batch, tokens, dim]:

W_q, W_k, W_v map the last dimension
after projection, split into heads
transpose so attention is computed over tokens
confirm score matrix shape is [batch, heads, tokens, tokens]

Common mistakes

Forgetting the transpose between key and query dimensions.
Applying softmax over the wrong axis.
Reshaping before confirming contiguous memory layout.

Rule of thumb

If a transformer implementation feels confusing, write down the tensor shape after every line before debugging the math.