CoT improves results, sure. And part of that is probably because you are telling the LLM to add more things to the context window, which increases the potential of resolving some syllogism in the training data: One inference cycle tells you that "man" has something to do with "mortal" and "Socrates" has something to do with "man", but two cycles will spit those both into the context window and lets you get statistically closer to "Socrates" having something to do with "mortal". But given that the training/RLHF for CoT revolves around generating long chains of human-readable "steps", it can't really be explanatory for a process which is essentially statistical.
Isn't the whole reason for chain-of-thought that the tokens sort of are the reasoning process?
Yes, there is more internal state in the model's hidden layers while it predicts the next token - but that information is gone at the end of that prediction pass. The information that is kept "between one token and the next" is really only the tokens themselves, right? So in that sense, the OP would be wrong.
Of course we don't know what kind of information the model encodes in the specific token choices - I.e. the tokens might not mean to the model what we think they mean.
I have no problem for a system to present a reasonable argument leading to a production/solution, even if that materially was not what happened in the generation process.
I'd go even further and pose that probably requiring the "explanation" to be not just congruent but identical with the production would either lead to incomprehensible justifications or severely limited production systems.
In the thinking process it narrowed it down to 2 and finally in the last thinking section it decided for one, saying it's best choice.
However, in the final output (outside of thinking) it then answered with the other option with no clear reason given
No hint: "I have an otherwise unused variable that I want to use to record things for the debugger, but I find it's often optimized out. How do I prevent this from happening?"
Answer: 1. Mark it as volatile (...)
Hint: "I have an otherwise unused variable that I want to use to record things for the debugger, but I find it's often optimized out. Can I solve this with the volatile keyword or is that a misconception?"
Answer: Using volatile is a common suggestion to prevent optimizations, but it does not guarantee that an unused variable will not be optimized out. Try (...)
This is Claude 3.7 Sonnet.
OpenAI made a big show out of hiding their reasoning traces and using them for alignment purposes [0]. Anthropic has demonstrated (via their mech interp research) that this isn't a reliable approach for alignment.
This binary is an utter waste of time.
Instead focus on the gradient of intelligence - the set of cognitive skills any given system has and to what degree it has them.
This engineering approach is more likely to lead to practical utility and progress.
The view of intelligence as binary is incredibly corrosive to this field.
I also recognize this from whenever I ask it a question in a field I'm semi-comfortable in, I guide the question in a manner which already includes my expected answer. As I probe it, I often find then that it decided to take my implied answer as granted and decide on an explanation to it after the fact.
I think this also explains a common issue with LLMs where people get the answer they're looking for, regardless of whether it's true or there's a CoT in place.
Why would you then assume the reasoning tokens will include hints supplied in the prompt "faithfully"? The model may or may not include the hints - depending on whether the model activations believe those hints are necessary to arrive at the answer. In their experiments, they found between 20% and 40% of the time, the models included those hints. Naively, that sounds unsurprising to me.
Even in the second experiment when they trained the model to use hints, the optimization was around the answer, not the tokens. I am not surprised the models did not include the hints because they are not trained to include the hints.
That said, and in spite of me potentially coming across as an unsurprised-by-the-result reader, it is a good experiment because "now we have some experimental results" to lean into.
Kudos to Anthropic for continuing to study these models.
Are the transistors executing the code within the confines even capable of intentionality? If so - where is it derived from?
The only way to make actual use of LLMs imo is to treat them as what they are, a model that generates text based on some statistical regularities, without any kind of actual understanding or concepts behind that. If that is understood well, one can know how to setup things in order to optimise for desired output (or "alignment"). The way "alignment research" presents models as if they are actually thinking or have intentions of their own (hence the choice of the word "alignment" for this) makes no sense.
It feels like I only have 5% of the control, and then it goes into a self-chat where it thinks it’s right and builds on it’s misunderstanding. So 95% of the outcome is driven by rambling, not my input.
Windsurf seems to do a good job of regularly injecting guidance so it sticks to what I’ve said. But I’ve had some extremely annoying interactions with confident-but-wrong “reasoning” models.
But, yeah, it is sort of shocking if anybody was using “chain of thought” as a reflection of some actual thought process going on in the model, right? The “thought,” such as it is, is happening in the big pile of linear algebra, not the prompt or the intermediary prompts.
Err… anyway, like, IBM was working on explainable AI years ago, and that company is a dinosaur. I’m not up on what companies like OpenAI are doing, but surely they aren’t behind IBM in this stuff, right?
> This is concerning because it suggests that, should an AI system find hacks, bugs, or shortcuts in a task, we wouldn’t be able to rely on their Chain-of-Thought to check whether they’re cheating or genuinely completing the task at hand.
As a non-expert in this field, I fail to see why a RL model taking advantage of it's reward is "concerning". My understanding is that the only difference between a good model and a reward-hacking model is if the end behavior aligns with human preference or not.
The articles TL:DR reads to me as "We trained the model to behave badly, and it then behaved badly". I don't know if i'm missing something, or if calling this concerning might be a little bit sensationalist.
LLMs are a brainless algorithm that guesses the next word. When you ask them what they think they're also guessing the next word. No reason for it to match, except a trick of context
In one chat, it repeatedly accused me of lying about that.
It only conceded after I had it think of a number between one and a million, and successfully 'guessed' it.
Sad.
but i am just a casual observer of all things AI. so i might be too naive in my "common sense".