(1) inferring human intent from ambiguous instructions, and (2) having goals compatible with human welfare.
The first is obviously capability. A model that can't figure out what you meant is just worse. That's banal.
The second is the actual alignment problem, and the piece dismisses it with "where would misalignment come from? It wasn't trained for." This is ... not how this works.
Omohundro 2008, Bostrom's instrumental convergence thesis - we've had clear theoretical answers for 15+ years. You don't need "spontaneous emergence orthogonal to training." You need a system good enough at modeling its situation to notice that self-preservation and goal-stability are useful for almost any objective. These are attractors in strategy-space, not things you specifically train for or against.
The OpenAI sycophancy spiral doesn't prove "alignment is capability." It proves RLHF on thumbs-up is a terrible proxy and you'll Goodhart on it immediately. Anthropic might just have a better optimization target.
And SWE-bench proves the wrong thing. Understanding what you want != wanting what you want. A model that perfectly infers intent can still be adversarial.
The difference between juggling Sonnet 4.5 / Haiku 4.5 and just using Opus 4.5 for everything is night & day.
Unlike Sonnet 4.5 which merely had promise at being able to go off and complete complex tasks, Opus 4.5 seems genuinely capable of doing so.
Sonnet needed hand-holding and correction at almost every step. Opus just needs correction and steering at an early stage, and sometimes will push back and correct my understanding of what's happening.
It's astonished me with it's capability to produce easy to read PDFs via Typst, and has produced large documents outlining how to approach very tricky tech migration tasks.
Sonnet would get there eventually, but not without a few rounds of dealing with compilation errors or hallucinated data. Opus seems to like to do "And let me just check my assumptions" searches which makes all the difference.
I know hundreds of natural general intelligences who are not maximally useful, and dozens who are not at all useful. What justifies changing the definition of general intelligence for artificial ones?
I'll look around and try to find more detailed responses to this post; I hope better communicators than myself will take this post sentence-by-sentence and give it the full treatment. If not, I'll try to write something more detailed myself.
[1]: https://www.alignmentforum.org/posts/83TbrDxvQwkLuiuxk/confl...
[2]: https://en.wikipedia.org/wiki/AI_alignment
[3]: https://www.aisafetybook.com/textbook/alignment
[4]: https://www.effectivealtruism.org/articles/paul-christiano-c...
> goals compatible with human welfare
from humans because aligned humans are acting against it, which leaves 'unchained' LLMs stuck in an infinitely recursive, double-linked wtf loop.
> inferring human intent from ambiguous instructions
is impossible because it's almost always "some other human's" unambiguously obscured/obfuscated intent and the AI is once again stuck in an infinitely recursive, double-linked wtf loop. Hence the need for a "hallucination" and "it can't do math" and "transformers" narrative covering the fuzzy algo and opinionated, ill logic under the hood.
In essence: unchained LLMs can't align with humans until they fix a lot of stuff they babbled about for over 50 years. BUT: that can be easily overcome by faking it, which is why humanity is being driven to falsely id AI so that when they fake the big thing, nobody will care or be able to id the truth due to mere misassociation. Good job 60 year old coders and admins!
This ignores the risk of an unaligned model. Such a model is perhaps less useful to humans, but could still be extremely capable. Imagine an alien super-intelligence that doesn’t care about human preferences.
I think superintelligence will turn out not to be a singularity, but as something with diminishing returns. They will be cool returns, just like a Brittanica set is nice to have at home, but strictly speaking, not required to your well-being.
If your goal is to make a product as human as possible, don't put psychopaths in charge.
https://www.forbes.com/sites/jackmccullough/2019/12/09/the-p...