Weight-sparse transformers have interpretable circuits

Transform: Encoder, Decoder, from tokens to embeddings to tokens | from electricity to magnetics to electricity | Fourier Transform | LLM Visualization

Overall Setup |

Superposition 5

Sparse Model Training

sparse models contain small, disentangled circuits that are both understandable and sufficient to perform the behavior. | defined as a set of nodes connected by edges.

Circuit: a set of nodes connected by edges | node: individual neuron | edge: non-zero entry on weight matrix.

\(L_{0}\) norm constraint: A.2. Weight-Sparse Model Optimization

Activation: AbsTopK zero out all but \(k\) largest values by magnitude.

Root Mean Square Layer Normalization: RMSNorm6 , rescaling variance keeps the output representations intact when both inputs and weights are randomly scaled, recentering enables the model to be insensitive to shift noises on both inputs and weights.

Attention Sinks7:

Training: A.2. Weight-Sparse Model Optimization | AdamW Weight Decay \(L_{2}\) Regularization | anneal \(L_{0}\) linearly on first 50% | Sharkfin Learning Rate Larger LR for more sparse model | gradient clipping: clip the root-mean-square of the gradient to 1 ||

Pruning

Pruning nodes after pretraining for minimal circuit size when facing specific tasks:

Tasks: Python syntax based next-token binary prediction tasks Tasks Table ||set_or_string|||

Bridges

Bridges: starting from existing dense model to sparse model with bridges | sparse to dense mapping at activations, encoder linear map+AbsTopK from dense to sparse | decoder linear map from sparse to dense | sparse model from encoder/decoder training from dense model | auto encoder latent space is the sparse model’s residual activations.

  • Loss: bridging loss terms+pretraining loss
  • \(\mathcal{L}_{\mathrm{NMSE}}\): normalized mean square error | \(\mathcal{L}_{\mathrm{KL, s\to d}}\) | \(\mathcal{L}_{\mathrm{KL, d\to s}}\) | KL Divergence
  • Autoencoder9

Task Loss of Sparse vs Dense Models | Plot of nterpretability versus capability

Limitation

  • Inefficiency: The unstructured sparse model is not optimized for GPU with dead neurons.

feature quality->activation sparsity \(L_{0}\) weight sparsity induces activation sparsity.

https://twitter.com/seoscottsdale/status/1998787418904998054

HackerNews https://news.ycombinator.com/item?id=45926371

Hardware:

https://huggingface.co/openai/circuit-sparsity

Sparse Attention Post-Training for Mechanistic Interpretability10

  1. Understanding neural networks through sparse circuits[]
  2. Gao, Leo, Achyuta Rajaram, Jacob Coxon, Soham V. Govande, Bowen Baker, and Dan Mossing. “Weight-sparse transformers have interpretable circuits.” arXiv preprint arXiv:2511.13653 (2025).[]
  3. GitHub[]
  4. Hugging Face[]
  5. Elhage, Nelson, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds et al. “Toy models of superposition.” arXiv preprint arXiv:2209.10652 (2022).[]
  6. Zhang, Biao, and Rico Sennrich. “Root mean square layer normalization.” Advances in neural information processing systems 32 (2019).[]
  7. Xiao, Guangxuan, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. “Efficient streaming language models with attention sinks.” arXiv preprint arXiv:2309.17453 (2023).[]
  8. Fetterman, Abraham J., Ellie Kitanidis, Joshua Albrecht, Zachary Polizzi, Bryden Fogelman, Maksis Knutins, Bartosz Wróblewski, James B. Simon, and Kanjun Qiu. “Tune as you scale: Hyperparameter optimization for compute efficient training.” arXiv preprint arXiv:2306.08055 (2023).[]
  9. Tschannen, Michael, Olivier Bachem, and Mario Lucic. “Recent advances in autoencoder-based representation learning.” arXiv preprint arXiv:1812.05069 (2018).[]
  10. Draye, Florent, Anson Lei, Ingmar Posner, and Bernhard Schölkopf. “Sparse Attention Post-Training for Mechanistic Interpretability.” arXiv preprint arXiv:2512.05865 (2025).[]

Posted

in

, ,

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

🧭