Weight-sparse transformers have interpretable circuits

Table of Contents

Weight-sparse transformers have interpretable circuits¹²³⁴: Train sparse model on weights and pruning to explain interpretability, find connections between sparse and dense models.

Transform: Encoder, Decoder, from tokens to embeddings to tokens | from electricity to magnetics to electricity | Fourier Transform | LLM Visualization

Overall Setup |

Superposition ⁵

Sparse Model Training

sparse models contain small, disentangled circuits that are both understandable and sufficient to perform the behavior. | defined as a set of nodes connected by edges.

Circuit: a set of nodes connected by edges | node: individual neuron | edge: non-zero entry on weight matrix.

\(L_{0}\) norm constraint: A.2. Weight-Sparse Model Optimization

Activation: AbsTopK zero out all but \(k\) largest values by magnitude.

Root Mean Square Layer Normalization: RMSNorm⁶ , rescaling variance keeps the output representations intact when both inputs and weights are randomly scaled, ~~recentering~~ enables the model to be insensitive to shift noises on both inputs and weights.

internal covariate shift: a layer’s input distribution changes as previous layers are updated.
ensure 0 has previledged meaning in residue stream, RMSNorm does not adjust mean | fold all normalization weights into MLP/ATT weights without altering the weight \(L_{0}\) \(y = W(\gamma \odot x) \;\to\; W\,\mathrm{diag}(\gamma)\,x \;\to\; W’x,\;\; W’ = W\,\mathrm{diag}(\gamma).\).

Attention Sinks⁷:

Training: A.2. Weight-Sparse Model Optimization | AdamW Weight Decay \(L_{2}\) Regularization | anneal \(L_{0}\) linearly on first 50% | Sharkfin Learning Rate Larger LR for more sparse model | gradient clipping: clip the root-mean-square of the gradient to 1 ||

Pruning

Pruning nodes after pretraining for minimal circuit size when facing specific tasks:

structured pruning algorithm: masks to gate respective node locations \(x_i \mapsto x_i \odot \text{Heaviside}(\tau_i)\approx x_i \odot \sigma(\tau_i)\) | sigmoid-derivative surrogate gradient to backprop heaviside step function | loss function: specific task cross entropy+nonzero elements in mask | Mask discretization: achieve threshold 0.15 | Hyperparameter Tuning: Cost-Aware Pareto. Region Bayesian Search CARBS⁸ | A.5. Pruning algorithm
mask=0 mean ablation activation frozen at mean activation over the pretraining distribution. | E. Validating circuit hypothesis with mean ablations | A.6. Dataset for pretraining

Tasks: Python syntax based next-token binary prediction tasks Tasks Table ||set_or_string|||

Bridges

Bridges: starting from existing dense model to sparse model with bridges | sparse to dense mapping at activations, encoder linear map+AbsTopK from dense to sparse | decoder linear map from sparse to dense | sparse model from encoder/decoder training from dense model | auto encoder latent space is the sparse model’s residual activations.

Loss: bridging loss terms+pretraining loss
\(\mathcal{L}_{\mathrm{NMSE}}\): normalized mean square error | \(\mathcal{L}_{\mathrm{KL, s\to d}}\) | \(\mathcal{L}_{\mathrm{KL, d\to s}}\) | KL Divergence
Autoencoder⁹

Task Loss of Sparse vs Dense Models | Plot of nterpretability versus capability

Limitation

Inefficiency: The unstructured sparse model is not optimized for GPU with dead neurons.

feature quality->activation sparsity \(L_{0}\) weight sparsity induces activation sparsity.

https://twitter.com/seoscottsdale/status/1998787418904998054

HackerNews https://news.ycombinator.com/item?id=45926371

Hardware:

https://huggingface.co/openai/circuit-sparsity

Sparse Attention Post-Training for Mechanistic Interpretability¹⁰

Print 🖨 eBook 📱

Understanding neural networks through sparse circuits[↩]
Gao, Leo, Achyuta Rajaram, Jacob Coxon, Soham V. Govande, Bowen Baker, and Dan Mossing. “Weight-sparse transformers have interpretable circuits.” arXiv preprint arXiv:2511.13653 (2025).[↩]
GitHub[↩]
Hugging Face[↩]
Elhage, Nelson, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds et al. “Toy models of superposition.” arXiv preprint arXiv:2209.10652 (2022).[↩]
Zhang, Biao, and Rico Sennrich. “Root mean square layer normalization.” Advances in neural information processing systems 32 (2019).[↩]
Xiao, Guangxuan, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. “Efficient streaming language models with attention sinks.” arXiv preprint arXiv:2309.17453 (2023).[↩]
Fetterman, Abraham J., Ellie Kitanidis, Joshua Albrecht, Zachary Polizzi, Bryden Fogelman, Maksis Knutins, Bartosz Wróblewski, James B. Simon, and Kanjun Qiu. “Tune as you scale: Hyperparameter optimization for compute efficient training.” arXiv preprint arXiv:2306.08055 (2023).[↩]
Tschannen, Michael, Olivier Bachem, and Mario Lucic. “Recent advances in autoencoder-based representation learning.” arXiv preprint arXiv:1812.05069 (2018).[↩]
Draye, Florent, Anson Lei, Ingmar Posner, and Bernhard Schölkopf. “Sparse Attention Post-Training for Mechanistic Interpretability.” arXiv preprint arXiv:2512.05865 (2025).[↩]

Posted

December 6, 2025

Uncategorized, AI, Reading

Xiaomeng Wang

Tags:

notes, research

Weight-sparse transformers have interpretable circuits

Sparse Model Training

Pruning

Bridges

Limitation

Comments

Leave a Reply Cancel reply