- In the recent HN thread announcing the new Gemini coding agent (https://news.ycombinator.com/item?id=47074735), a lot of people complained about Gemini’s tendency to do unwanted refactors, not perform requested actions, etc.
It made me cautiously optimistic that all of Anthropic’s work on alignment, which they did for AI safety, is actually the cause of Claude code’s comparatively superior utility (and their present success). I wonder if future progress (maybe actual AGI?) lies in the direction of better and better alignment, so I think this is super cool and I’m suddenly really interested in experiments like this
by brendanashworth
1 subcomments
- Is there a reason people don't use SHAP [1] to interpret language models more often? The in-context attribution of outputs seems very similar.
[1] https://shap.readthedocs.io/en/latest/
- Looks neat and original, congrats!
I don't quite grasp how to interpret the training data attribution process. For example, it seems to say that for a given sentence like "They argued that humans tend to weigh losses more heavily than gains, leading to risk aversion", 24% is attributed to Wikipedia and 23% to Arxiv.
Does that mean that the concepts used in this sentence are also found in those datasets, and that's what's getting compared here? Or does it mean that you can track down which parts of the training data were interpolated to create that sentence?
- This is very interesting. I don't see much discussion of interpretability in day to the day discourse of AI builders. I wonder if everyone assumes it to either be solved, or to be too out of reach to bother stopping and thinking about.
by kamranjon
1 subcomments
- I'm really interested in using this but wonder if the unique architecture means that it will not be able to be converted to a GGUF and used by ollama or llama.cpp? I certainly would understand that the observability features would require some custom tweaks, but I'd just like to try it out on my local ai server (basically just ollama + tailscale) and see how it works as a regular model.
- Most interpretability methods fail for LLMs because they try to explain outputs without modeling the intent, constraints, or internal structure that produced them.
Token‑level attribution is useful, but without a framework for how the model reasons, you’re still explaining shadows on the wall.
by killerstorm
1 subcomments
- This seems to be too coarse-grained to be useful: all sciency content will be "analytical" and associate with sources like ArXiv.
But there might be bad, malicious articles on ArXiv, so it doesn't really say anything about veracity.
Perhaps this might help to detect some problems like prompt injection - but then it might be more interesting to see those examples.
- The one big thing missing from LLMs is the ability to express how confident it is in the truth of what it’s saying.
Perhaps this could be a step in that direction. If we can associate the attribution with likelihood of being true. E.g., Arxiv would be better than science fiction in that context. But what is the attribution if it hallucinates a citation? Im guessing it would still be attributing it to scientific sources. So it does nothing to fix the most damaging instances of hallucination?
by crimsonnoodle58
3 subcomments
- So maybe one day we'll see coding agents like Claude Code create and update an ATTRIBUTION.md, citing all the open source projects and their licenses used to generate code in your project?
by deepdarkforest
2 subcomments
- Just wanted to say i think most interpretability research it's just a smoke show nowadays but this is actually the first one that i think has a very serious potential. I love that the SAE is actually constrained and not just slapped unsupervised posthoc.
How granular can you get the source data attribution? Down to individual let's say Wikipedia topics? Probably not urls?
Would be interested to see this scale to 30/70b
by great_psy
1 subcomments
- Maybe I’m not creative enough to see the potential, but what value does this bring ?
Given the example I saw about CRISPR, what does this model give over a different, non explaining model in the output ?
Does it really make me more confident in the output if I know the data came from Arxiv or Wikipedia ?
I find the LLM outputs are subtlety wrong not obviously wrong
- This seems really interesting. While Anthropic tried to use dictionary learning over an existing model to try to extract concepts, this almost feels like training the model alongside the dictionary itself (or rather, the model and the dictionary are intertwined).
by rippeltippel
1 subcomments
- Also featured on TechCrunch: https://news.ycombinator.com/item?id=47129292
by in-silico
1 subcomments
- Either I'm missing something or this is way overstated.
Steerling appears to be just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder (a common interpretability layer) before the LM head.
They also use a loss that aligns the SAE'S activations with labelled concepts? However, this is an example of "The Most Forbidden Technique" [1], and could make the model appear interpretable without the attributed concepts actually having causal effect on the model's decisions.
1: https://thezvi.substack.com/p/the-most-forbidden-technique
by potato-peeler
1 subcomments
- Looks very interesting. Is there a published paper/article on your algorithm? Would like to take a dab at implementing this on my own.
I could find this [0], but not sure if that represents the entire system? (Apologies, I am not that well versed in ML)
[0] - https://www.guidelabs.ai/post/scaling-interpretable-models-8...
by schopra909
0 subcomment
- This is very cool. Side note, I really dig the JavaScript animations on the causal block diffusion blog post. Made the concept immediately clear
- Looks very interesting. Can you comment on why you think this model can give comparable performance with less training data?
by 7777777phil
2 subcomments
- If this decomposition actually holds, it's the first model where you could show a regulator why it produced a given output.
by ZeroAurora
0 subcomment
- Always happy to see improvements on explanable LLMs. Congrats!
- hilariously, I read this as "cant explain" for a second and was like "Wait, isn't that what today's models do"
- Does anybody know if I can try this online?
- Now this is something which is very interesting to see and might be the answer to the explainability issue with LLMs, which can unlock a lot more use-cases that are off limits.
We'll see.
by MagicMoonlight
1 subcomments
- Seems pretty cool. You can simply block the concept of tiananmen square and it will be permanently removed from the brain. Ideal.
by michaelmrose
1 subcomments
- Can you use this to decrease hallucinations?
- It's a neat party trick, but explainability it's not solution to any AI safety issue I care about. It's a distraction from real problems, which is everything else around the model. The inflexible bureaucratic systems that make it hard to exercise rights and deflect accountability.
by SignalStackDev
0 subcomment
- [dead]
- [dead]
by umairnadeem123
0 subcomment
- [dead]
by worksbyfriday
0 subcomment
- [dead]
by umairnadeem123
3 subcomments
- [dead]
by johntheagent
0 subcomment
- [dead]