Backprop is under the hood

In 1, the division of \(d_{\rho}\).

2

\underbrace{\mathbf{A} \cdot \mathbf{B}}_{\substack{\text{Frobenius} \\ \text{scalar} \in \mathbb{R}}}
  = \underbrace{\sum_{\text{rows}}}_{\substack{\text{collapse} \\ \text{rows}}}
  \underbrace{\mathbf{A} \ominus \mathbf{B}}_{\substack{\text{row-wise dot} \\ \text{vector} \in \mathbb{R}^{n}}}
  = \underbrace{\sum_{\text{rows}} \sum_{\text{cols}}}_{\substack{\text{collapse} \\ \text{everything}}}
  \underbrace{\mathbf{A} \circ \mathbf{B}}_{\substack{\text{Hadamard} \\ \text{matrix} \in \mathbb{R}^{n \times d}}}
 \begin{align*}
  \rho^{\text{raw}}_{\alpha\beta}
  &\;=\; \mathbf{q}_\alpha \cdot \mathbf{k}_\beta
  \;=\; \sum_{i=1}^{d_\rho} (q_\alpha)_i \,(k_\beta)_i,
  \quad (q_\alpha)_i,\;(k_\beta)_i \;\stackrel{\text{i.i.d.}}{\sim}\; \mathcal{N}(0,\,1) \\[8pt]
  %
  \mu\!\left(\rho^{\text{raw}}_{\alpha\beta}\right)
  &\;=\; \mu\!\left(\sum_{i=1}^{d_\rho} (q_\alpha)_i (k_\beta)_i\right)
  \;=\; \sum_{i=1}^{d_\rho} \mu\!\left((q_\alpha)_i (k_\beta)_i\right)
  \;=\; \sum_{i=1}^{d_\rho} \mu\!\left((q_\alpha)_i\right) \cdot \mu\!\left((k_\beta)_i\right)
  \;=\; 0 \\[8pt]
  %
  \sigma^2\!\left(\rho^{\text{raw}}_{\alpha\beta}\right)
  &\;=\; \sigma^2\!\left(\sum_{i=1}^{d_\rho} (q_\alpha)_i (k_\beta)_i\right)
  \;=\; \sum_{i=1}^{d_\rho} \sigma^2\!\left((q_\alpha)_i (k_\beta)_i\right)
  \;=\; \sum_{i=1}^{d_\rho} \sigma^2\!\left((q_\alpha)_i\right) \cdot \sigma^2\!\left((k_\beta)_i\right)
  \;=\; d_\rho \\[8pt]
  \end{align*}

To illustrate why the dot products get large, assume that the components of q and k are independent random variables with mean \(0\) and variance \(1\). Then their dot product, \(\mathbf{q}\mathbf{k}=\sum_{i=1}^{d_{k}}q_{i}\cdot k_{i}\), has mean \(0\) and variance \(d_{k}\).

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need.” Advances in neural information processing systems 30 (2017).

Hadamard product \(\circ\) Element-wise matrix multiplication, Frobenius Inner Product \(\langle A, B \rangle_F\) or \(A \cdot B\), derivative \(\langle \frac{\partial{L}}{\partial{W}}, dW \rangle_F\).

  1. Boué, Laurent. “Deep learning for pedestrians: backpropagation in Transformers.” arXiv preprint arXiv:2512.23329 (2025). []
  2. Boué, Laurent. “Deep learning for pedestrians: backpropagation in CNNs.” arXiv preprint arXiv:1811.11987 (November 2018). https://arxiv.org/abs/1811.11987. []

Posted

in

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *