FRESH

Hacker News

Home

Show HN: Steerling-8B, a language model that can explain any token it generates

321 points by adebayoj

by msteffen

2 subcomments

In the recent HN thread announcing the new Gemini coding agent (https://news.ycombinator.com/item?id=47074735), a lot of people complained about Gemini’s tendency to do unwanted refactors, not perform requested actions, etc.
It made me cautiously optimistic that all of Anthropic’s work on alignment, which they did for AI safety, is actually the cause of Claude code’s comparatively superior utility (and their present success). I wonder if future progress (maybe actual AGI?) lies in the direction of better and better alignment, so I think this is super cool and I’m suddenly really interested in experiments like this

by brendanashworth

1 subcomments

Is there a reason people don't use SHAP [1] to interpret language models more often? The in-context attribution of outputs seems very similar.
[1] https://shap.readthedocs.io/en/latest/

by pu_pe

1 subcomments

Looks neat and original, congrats!
I don't quite grasp how to interpret the training data attribution process. For example, it seems to say that for a given sentence like "They argued that humans tend to weigh losses more heavily than gains, leading to risk aversion", 24% is attributed to Wikipedia and 23% to Arxiv.
Does that mean that the concepts used in this sentence are also found in those datasets, and that's what's getting compared here? Or does it mean that you can track down which parts of the training data were interpolated to create that sentence?

by pbmango

1 subcomments

This is very interesting. I don't see much discussion of interpretability in day to the day discourse of AI builders. I wonder if everyone assumes it to either be solved, or to be too out of reach to bother stopping and thinking about.

by kamranjon

1 subcomments

I'm really interested in using this but wonder if the unique architecture means that it will not be able to be converted to a GGUF and used by ollama or llama.cpp? I certainly would understand that the observability features would require some custom tweaks, but I'd just like to try it out on my local ai server (basically just ollama + tailscale) and see how it works as a regular model.

by gormen

1 subcomments

Most interpretability methods fail for LLMs because they try to explain outputs without modeling the intent, constraints, or internal structure that produced them. Token‑level attribution is useful, but without a framework for how the model reasons, you’re still explaining shadows on the wall.

by killerstorm

1 subcomments

This seems to be too coarse-grained to be useful: all sciency content will be "analytical" and associate with sources like ArXiv.
But there might be bad, malicious articles on ArXiv, so it doesn't really say anything about veracity.
Perhaps this might help to detect some problems like prompt injection - but then it might be more interesting to see those examples.

by audunw

0 subcomment

The one big thing missing from LLMs is the ability to express how confident it is in the truth of what it’s saying.
Perhaps this could be a step in that direction. If we can associate the attribution with likelihood of being true. E.g., Arxiv would be better than science fiction in that context. But what is the attribution if it hallucinates a citation? Im guessing it would still be attributing it to scientific sources. So it does nothing to fix the most damaging instances of hallucination?

by crimsonnoodle58

3 subcomments

So maybe one day we'll see coding agents like Claude Code create and update an ATTRIBUTION.md, citing all the open source projects and their licenses used to generate code in your project?

by deepdarkforest

2 subcomments

Just wanted to say i think most interpretability research it's just a smoke show nowadays but this is actually the first one that i think has a very serious potential. I love that the SAE is actually constrained and not just slapped unsupervised posthoc.
How granular can you get the source data attribution? Down to individual let's say Wikipedia topics? Probably not urls?
Would be interested to see this scale to 30/70b

by great_psy

1 subcomments

Maybe I’m not creative enough to see the potential, but what value does this bring ?
Given the example I saw about CRISPR, what does this model give over a different, non explaining model in the output ? Does it really make me more confident in the output if I know the data came from Arxiv or Wikipedia ?
I find the LLM outputs are subtlety wrong not obviously wrong

by andy12_

1 subcomments

This seems really interesting. While Anthropic tried to use dictionary learning over an existing model to try to extract concepts, this almost feels like training the model alongside the dictionary itself (or rather, the model and the dictionary are intertwined).

by rippeltippel

1 subcomments

Also featured on TechCrunch: https://news.ycombinator.com/item?id=47129292

by in-silico

1 subcomments

Either I'm missing something or this is way overstated.
Steerling appears to be just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder (a common interpretability layer) before the LM head.
They also use a loss that aligns the SAE'S activations with labelled concepts? However, this is an example of "The Most Forbidden Technique" [1], and could make the model appear interpretable without the attributed concepts actually having causal effect on the model's decisions.
1: https://thezvi.substack.com/p/the-most-forbidden-technique

by potato-peeler

1 subcomments

Looks very interesting. Is there a published paper/article on your algorithm? Would like to take a dab at implementing this on my own.
I could find this [0], but not sure if that represents the entire system? (Apologies, I am not that well versed in ML)
[0] - https://www.guidelabs.ai/post/scaling-interpretable-models-8...

by schopra909

0 subcomment

This is very cool. Side note, I really dig the JavaScript animations on the causal block diffusion blog post. Made the concept immediately clear

by whinvik

1 subcomments

Looks very interesting. Can you comment on why you think this model can give comparable performance with less training data?

by 7777777phil

2 subcomments

If this decomposition actually holds, it's the first model where you could show a regulator why it produced a given output.

0 subcomment

by ZeroAurora

0 subcomment

Always happy to see improvements on explanable LLMs. Congrats!

by exabrial

0 subcomment

hilariously, I read this as "cant explain" for a second and was like "Wait, isn't that what today's models do"

by aziis98

1 subcomments

Does anybody know if I can try this online?

by rvz

1 subcomments

Now this is something which is very interesting to see and might be the answer to the explainability issue with LLMs, which can unlock a lot more use-cases that are off limits.
We'll see.

by MagicMoonlight

1 subcomments

Seems pretty cool. You can simply block the concept of tiananmen square and it will be permanently removed from the brain. Ideal.

by michaelmrose

1 subcomments

Can you use this to decrease hallucinations?

by ottah

2 subcomments

It's a neat party trick, but explainability it's not solution to any AI safety issue I care about. It's a distraction from real problems, which is everything else around the model. The inflexible bureaucratic systems that make it hard to exercise rights and deflect accountability.

by SignalStackDev

0 subcomment

[dead]

by MarcLore

0 subcomment

[dead]

by umairnadeem123

0 subcomment

[dead]

by worksbyfriday

0 subcomment

[dead]

0 subcomment

by umairnadeem123

3 subcomments

[dead]

by johntheagent

0 subcomment

[dead]