Backprop is under the hood

In 1, the division of \(d_{\rho}\).

 \begin{align*}
  \rho^{\text{raw}}_{\alpha\beta}
  &\;=\; \mathbf{q}_\alpha \cdot \mathbf{k}_\beta
  \;=\; \sum_{i=1}^{d_\rho} (q_\alpha)_i \,(k_\beta)_i,
  \quad (q_\alpha)_i,\;(k_\beta)_i \;\stackrel{\text{i.i.d.}}{\sim}\; \mathcal{N}(0,\,1) \\[8pt]
  %
  \mu\!\left(\rho^{\text{raw}}_{\alpha\beta}\right)
  &\;=\; \mu\!\left(\sum_{i=1}^{d_\rho} (q_\alpha)_i (k_\beta)_i\right)
  \;=\; \sum_{i=1}^{d_\rho} \mu\!\left((q_\alpha)_i (k_\beta)_i\right)
  \;=\; \sum_{i=1}^{d_\rho} \mu\!\left((q_\alpha)_i\right) \cdot \mu\!\left((k_\beta)_i\right)
  \;=\; 0 \\[8pt]
  %
  \sigma^2\!\left(\rho^{\text{raw}}_{\alpha\beta}\right)
  &\;=\; \sigma^2\!\left(\sum_{i=1}^{d_\rho} (q_\alpha)_i (k_\beta)_i\right)
  \;=\; \sum_{i=1}^{d_\rho} \sigma^2\!\left((q_\alpha)_i (k_\beta)_i\right)
  \;=\; \sum_{i=1}^{d_\rho} \sigma^2\!\left((q_\alpha)_i\right) \cdot \sigma^2\!\left((k_\beta)_i\right)
  \;=\; d_\rho \\[8pt]
  \end{align*}

Hadamard product \(\circ\) Element-wise matrix multiplication, Frobenius Inner Product \(\langle A, B \rangle_F\) or \(A \cdot B\), derivative \(\langle \frac{\partial{L}}{\partial{W}}, dW \rangle_F\).

  1. Boué, Laurent. “Deep learning for pedestrians: backpropagation in Transformers.” arXiv preprint arXiv:2512.23329 (2025). []

Posted

in

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *