Weight-sparse transformers have interpretable circuits1234: Train sparse model on weights and pruning to explain interpretability, find connections between sparse and dense models.
Transform: Encoder, Decoder, from tokens to embeddings to tokens | from electricity to magnetics to electricity | Fourier Transform | LLM Visualization
Superposition 5
Sparse Model Training
sparse models contain small, disentangled circuits that are both understandable and sufficient to perform the behavior. | defined as a set of nodes connected by edges.
Circuit: a set of nodes connected by edges | node: individual neuron | edge: non-zero entry on weight matrix.
\(L_{0}\) norm constraint: A.2. Weight-Sparse Model Optimization
Activation: AbsTopK zero out all but \(k\) largest values by magnitude.
Root Mean Square Layer Normalization: RMSNorm6 , rescaling variance keeps the output representations intact when both inputs and weights are randomly scaled, recentering enables the model to be insensitive to shift noises on both inputs and weights.
- internal covariate shift: a layer’s input distribution changes as previous layers are updated.
- ensure 0 has previledged meaning in residue stream, RMSNorm does not adjust mean | fold all normalization weights into MLP/ATT weights without altering the weight \(L_{0}\) \(y = W(\gamma \odot x) \;\to\; W\,\mathrm{diag}(\gamma)\,x \;\to\; W’x,\;\; W’ = W\,\mathrm{diag}(\gamma).\).
Attention Sinks7:
Training: A.2. Weight-Sparse Model Optimization | AdamW Weight Decay \(L_{2}\) Regularization | anneal \(L_{0}\) linearly on first 50% | Sharkfin Learning Rate Larger LR for more sparse model | gradient clipping: clip the root-mean-square of the gradient to 1 ||
Pruning
Pruning nodes after pretraining for minimal circuit size when facing specific tasks:
- structured pruning algorithm: masks to gate respective node locations \(x_i \mapsto x_i \odot \text{Heaviside}(\tau_i)\approx x_i \odot \sigma(\tau_i)\) | sigmoid-derivative surrogate gradient to backprop heaviside step function | loss function: specific task cross entropy+nonzero elements in mask | Mask discretization: achieve threshold 0.15 | Hyperparameter Tuning: Cost-Aware Pareto. Region Bayesian Search CARBS8 | A.5. Pruning algorithm
- mask=0 mean ablation activation frozen at mean activation over the pretraining distribution. | E. Validating circuit hypothesis with mean ablations | A.6. Dataset for pretraining
Tasks: Python syntax based next-token binary prediction tasks Tasks Table ||set_or_string|||
Bridges
Bridges: starting from existing dense model to sparse model with bridges | sparse to dense mapping at activations, encoder linear map+AbsTopK from dense to sparse | decoder linear map from sparse to dense | sparse model from encoder/decoder training from dense model | auto encoder latent space is the sparse model’s residual activations.
- Loss: bridging loss terms+pretraining loss
- \(\mathcal{L}_{\mathrm{NMSE}}\): normalized mean square error | \(\mathcal{L}_{\mathrm{KL, s\to d}}\) | \(\mathcal{L}_{\mathrm{KL, d\to s}}\) | KL Divergence
- Autoencoder9
Task Loss of Sparse vs Dense Models | Plot of nterpretability versus capability
Limitation
- Inefficiency: The unstructured sparse model is not optimized for GPU with dead neurons.
feature quality->activation sparsity \(L_{0}\) weight sparsity induces activation sparsity.
https://twitter.com/seoscottsdale/status/1998787418904998054
HackerNews https://news.ycombinator.com/item?id=45926371
Hardware:
https://huggingface.co/openai/circuit-sparsity
Sparse Attention Post-Training for Mechanistic Interpretability10
- Understanding neural networks through sparse circuits[↩]
- Gao, Leo, Achyuta Rajaram, Jacob Coxon, Soham V. Govande, Bowen Baker, and Dan Mossing. “Weight-sparse transformers have interpretable circuits.” arXiv preprint arXiv:2511.13653 (2025).[↩]
- GitHub[↩]
- Hugging Face[↩]
- Elhage, Nelson, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds et al. “Toy models of superposition.” arXiv preprint arXiv:2209.10652 (2022).[↩]
- Zhang, Biao, and Rico Sennrich. “Root mean square layer normalization.” Advances in neural information processing systems 32 (2019).[↩]
- Xiao, Guangxuan, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. “Efficient streaming language models with attention sinks.” arXiv preprint arXiv:2309.17453 (2023).[↩]
- Fetterman, Abraham J., Ellie Kitanidis, Joshua Albrecht, Zachary Polizzi, Bryden Fogelman, Maksis Knutins, Bartosz Wróblewski, James B. Simon, and Kanjun Qiu. “Tune as you scale: Hyperparameter optimization for compute efficient training.” arXiv preprint arXiv:2306.08055 (2023).[↩]
- Tschannen, Michael, Olivier Bachem, and Mario Lucic. “Recent advances in autoencoder-based representation learning.” arXiv preprint arXiv:1812.05069 (2018).[↩]
- Draye, Florent, Anson Lei, Ingmar Posner, and Bernhard Schölkopf. “Sparse Attention Post-Training for Mechanistic Interpretability.” arXiv preprint arXiv:2512.05865 (2025).[↩]
Leave a Reply