FRESH

Hacker News

Home

Weight-sparse transformers have interpretable circuits [pdf]

75 points by 0x79de

by oli5679

1 subcomments

This ties directly into the superposition theory.
It is believed dense models cram many features into shared weights, making circuits hard to interpret.
Sparsity reduces that pressure by giving features more isolated space, so individual neurons are more likely to represent a single, interpretable concept.

by lambdaone

5 subcomments

I find this fascinating, as it raises the possibility of a single framework that can unify neural and symbolic computation by "defuzzing" activations into what are effectively symbols. Has anyone looked at the possibility of going the other way, by fuzzifying logical computation?

by m_ke

5 subcomments

We really need new hardware optimized for sparse compute. Deep Learning models would work way better with much higher dimensional sparse vectors but current hardware only excels at dense GMMs and structured sparsity.

by edvardas

0 subcomment

HTML version: https://arxiv.org/html/2511.13653v1

by robrenaud

0 subcomment

I worked on a similiar problem about a year ago, on large dense models.
https://www.lesswrong.com/posts/PkeB4TLxgaNnSmddg/scaling-sp...
In both cases, the goal is to actually learn a concrete circuit inside a network that solves specific Python next-token prediction tasks. We each end up with a crisp wiring diagram saying “these are the channels/neurons/heads that implement this particular bit of Python reasoning.”
Both projects cast circuit discovery as a gradient-based selection problem over a fixed base model. We train a mask that picks out a sparse subset of computational nodes as “the circuit,” while the rest are ablated. Their work learns masks over a weight-sparse transformer; ours learns masks over SAE latents and residual channels. But in both cases, the key move is the same: use gradients to optimize which nodes are included, rather than relying purely on heuristic search or attribution patching. Both approaches also use a gradual hardening schedule (continuous masks that are annealed or sharpened over time) so that we can keep gradients useful early on, then spend extra compute to push the mask towards a discrete, minimal circuit that still reproduces the model’s behavior.
The similarities extend to how we validate and stress-test the resulting circuits. In both projects, we drill down enough to notice “bugs” or quirks in the learned mechanism and to deliberately break it: by making simple, semantically small edits to the Python source, we can systematically cause the pruned circuit to fail and those failures generalize to the unpruned network. That gives us some confidence that we’re genuinely capturing the specific mechanism the model is using.

by Xmd5a

1 subcomments

Related:
From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning – https://arxiv.org/pdf/2505.17117 (Lecun/Jurafsky)
> Large Language Models (LLMs) demonstrate striking linguistic capabilities that suggest semantic understanding (Singh et al., 2024; Li et al., 2024). Yet, a critical question remains unanswered: Do 1arXiv:2505.17117v5 [cs.CL] 25 Sep 2025LLMs navigate the compression-meaning trade-off similarly to humans, or do they employ fundamentally different representational strategies? This question matters because true understanding, which goes beyond surface-level mimicry, requires representations that balance statistical efficiency with semantic richness (Tversky, 1977; Rosch, 1973b).
> To address this question, we apply Rate-Distortion Theory (Shannon, 1948) and Information Bottleneck principles (Tishby et al., 2000) to systematically compare LLM and human conceptual structures. We digitize and release seminal cognitive psychology datasets (Rosch, 1973b; 1975; McCloskey & Glucksberg, 1978), which are foundational studies that shaped our understanding of human categorization but were previously unavailable in a machine-readable form. These benchmarks, comprising 1,049 items across 34 categories with both membership and typicality ratings, offer unprecedented empirical grounding for evaluating whether LLMs truly understand concepts as humans do. It also offers much better quality data than the current crowdsourcing paradigm.
From typicality tests in the paper above, we can jump to:
The Guppy Effect as Interference – https://arxiv.org/abs/1208.2362
> One can refer to the situation wherein people estimate the typicality of an exemplar of the concept combination as more extreme than it is for one of the constituent concepts in a conjunctive combination as overextension. One can refer to the situation wherein people estimate the typicality of the exemplar for the concept conjunction as higher than that of both constituent concepts as double overextension. We posit that overextension is not a violation of the classical logic of conjunction, but that it signals the emergence of a whole new concept. The aim of this paper is to model the Guppy Effect as an interference effect using a mathematical representation in a complex Hilbert space and the formalism of quantum theory to represent states and calculate probabilities. This builds on previous work that shows that Bell Inequalities are violated by concepts [7, 8] and in particular by concept combinations that exhibit the Guppy Effect [1, 2, 3, 9, 10], and add to the investigation of other approaches using interference effects in cognition [11, 12, 13].
And from quantum interferences
Quantum-like contextuality in large language models – https://royalsocietypublishing.org/doi/epdf/10.1098/rspa.202...
> This paper provides the ﬁrst large-scale experimental evidence for contextuality in the large language model BERT. We constructed a linguistic schema modelled over a contextual quantum scenario, instantiated it in the Simple English Wikipedia, and extracted probability distributions for the instances. This led to the discovery of sheaf-contextual and CbD contextual instances. We prove that these contextual instances arise from semantically similar words by deriving an equation that relates degrees of contextuality to the Euclidean distance of BERT’s embedding vectors.
How can large language models become more human – https://discovery.ucl.ac.uk/id/eprint/10196296/1/2024.cmcl-1...
> Psycholinguistic experiments reveal that efficiency of human language use is founded on predictions at both syntactic and lexical levels. Previous models of human prediction exploiting LLMs have used an information theoretic measure called surprisal, with success on naturalistic text in a wide variety of languages, but under-performance on challenging text such as garden path sentences. This paper introduces a novel framework that combines the lexical predictions of an LLM with the syntactic structures provided by a dependency parser. The framework gives rise to an Incompatibility Fraction. When tested on two garden path datasets, it correlated well with human reading times, distinguished between easy and hard garden path, and outperformed surprisal.

by peter_d_sherman

0 subcomment

>"To assess the interpretability of our models, we isolate the small sparse circuits that our models use to perform each task using a novel pruning method. Since interpretable models should be easy to untangle, individual behaviors should be implemented by compact standalone circuits.
Sparse circuits are defined as a set of nodes connected by edges."
...which could also be considered/viewed as Graphs...
(Then from earlier in the paper):
>"We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task. These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them.
And (jumping around a bit more in the paper):
>"A major difficulty for interpreting transformers is that the activations and weights are not directly comprehensible; for example, neurons activate in unpredictable patterns that don’t correspond to human-understandable concepts. One hypothesized cause is superposition (Elhage et al., 2022b), the idea that dense models are an approximation to the computations of a much larger untangled sparse network."
A very interesting paper -- and a very interesting postulated potential relationship with superposition! (which also could be related to data compression... and if so, in turn, by relationship, potentially entropy as well...)
Anyway, great paper!