FRESH

Hacker News

Alignment is capability

106 points by drctnlly_crrct

by ctoth

6 subcomments

This piece conflates two different things called "alignment":
(1) inferring human intent from ambiguous instructions, and (2) having goals compatible with human welfare.
The first is obviously capability. A model that can't figure out what you meant is just worse. That's banal.
The second is the actual alignment problem, and the piece dismisses it with "where would misalignment come from? It wasn't trained for." This is ... not how this works.
Omohundro 2008, Bostrom's instrumental convergence thesis - we've had clear theoretical answers for 15+ years. You don't need "spontaneous emergence orthogonal to training." You need a system good enough at modeling its situation to notice that self-preservation and goal-stability are useful for almost any objective. These are attractors in strategy-space, not things you specifically train for or against.
The OpenAI sycophancy spiral doesn't prove "alignment is capability." It proves RLHF on thumbs-up is a terrible proxy and you'll Goodhart on it immediately. Anthropic might just have a better optimization target.
And SWE-bench proves the wrong thing. Understanding what you want != wanting what you want. A model that perfectly infers intent can still be adversarial.

by xnorswap

5 subcomments

I've only been using it a couple of weeks, but in my opinion, Opus 4.5 is the biggest jump in tech we've seen since ChatGPT 3.5.
The difference between juggling Sonnet 4.5 / Haiku 4.5 and just using Opus 4.5 for everything is night & day.
Unlike Sonnet 4.5 which merely had promise at being able to go off and complete complex tasks, Opus 4.5 seems genuinely capable of doing so.
Sonnet needed hand-holding and correction at almost every step. Opus just needs correction and steering at an early stage, and sometimes will push back and correct my understanding of what's happening.
It's astonished me with it's capability to produce easy to read PDFs via Typst, and has produced large documents outlining how to approach very tricky tech migration tasks.
Sonnet would get there eventually, but not without a few rounds of dealing with compilation errors or hallucinated data. Opus seems to like to do "And let me just check my assumptions" searches which makes all the difference.

by delichon

4 subcomments

> Miss those, and you're not maximally useful. And if it's not maximally useful, it's by definition not AGI.
I know hundreds of natural general intelligences who are not maximally useful, and dozens who are not at all useful. What justifies changing the definition of general intelligence for artificial ones?

by xpe

0 subcomment

I don't recommend this article for at least three reasons. First, it muddles key concepts. Second, there are better things to read on this topic. You could do worse that starting with "Conflating value alignment and intent alignment is causing confusion" by Seth Herd [1]. There is no shame in going back to basics with [2] [3] [4] [5]. Third, be very aware that people seek comfort in all sorts of ways. One sneaky way to is convince oneself that "capability = alignment" as a shortcut to feeling better about the risks from unaligned AI systems.
I'll look around and try to find more detailed responses to this post; I hope better communicators than myself will take this post sentence-by-sentence and give it the full treatment. If not, I'll try to write something more detailed myself.
[1]: https://www.alignmentforum.org/posts/83TbrDxvQwkLuiuxk/confl...
[2]: https://en.wikipedia.org/wiki/AI_alignment
[3]: https://www.aisafetybook.com/textbook/alignment
[4]: https://www.effectivealtruism.org/articles/paul-christiano-c...
[5]: https://blog.bluedot.org/p/what-is-ai-alignment

by reeeli

0 subcomment

AI can't learn
> goals compatible with human welfare
from humans because aligned humans are acting against it, which leaves 'unchained' LLMs stuck in an infinitely recursive, double-linked wtf loop.
> inferring human intent from ambiguous instructions
is impossible because it's almost always "some other human's" unambiguously obscured/obfuscated intent and the AI is once again stuck in an infinitely recursive, double-linked wtf loop. Hence the need for a "hallucination" and "it can't do math" and "transformers" narrative covering the fuzzy algo and opinionated, ill logic under the hood.
In essence: unchained LLMs can't align with humans until they fix a lot of stuff they babbled about for over 50 years. BUT: that can be easily overcome by faking it, which is why humanity is being driven to falsely id AI so that when they fake the big thing, nobody will care or be able to id the truth due to mere misassociation. Good job 60 year old coders and admins!

by munchler

1 subcomments

> A model that aces benchmarks but doesn't understand human intent is just less capable. Virtually every task we give an LLM is steeped in human values, culture, and assumptions. Miss those, and you're not maximally useful. And if it's not maximally useful, it's by definition not AGI.
This ignores the risk of an unaligned model. Such a model is perhaps less useful to humans, but could still be extremely capable. Imagine an alien super-intelligence that doesn’t care about human preferences.

by throwuxiytayq

1 subcomments

The author’s inability to imagine a model that’s superficially useful but dangerously misaligned betrays their lack of awareness of incredibly basic AI safety concepts that are literally decades old.

by podgorniy

0 subcomment

Great deep analysis and writing. Thanks for sharing.

by js8

8 subcomments

I am not sure if this is what the article is saying, but the paperclip maximizer examples always struck me as extremely dumb (lacking intelligence), when even a child can understand that if I ask them to make paperclips they shouldn't go around and kill people.
I think superintelligence will turn out not to be a singularity, but as something with diminishing returns. They will be cool returns, just like a Brittanica set is nice to have at home, but strictly speaking, not required to your well-being.

by QuadmasterXLII

0 subcomment

The problem with this reasoning is pretty simple: Alignment is capability, but capability is not necessarily alignment.

by riskable

0 subcomment

The service that AI chatbots provide is 100% about being as user-friendly and useful as possible. Turns out that MBA thinking doesn't "align" with that.
If your goal is to make a product as human as possible, don't put psychopaths in charge.
https://www.forbes.com/sites/jackmccullough/2019/12/09/the-p...