FRESH

Hacker News

Alignment pretraining: AI discourse creates self-fulfilling (mis)alignment

75 points by anigbrowl

by swsieber

1 subcomments

Ah, this is fun to see.
About six months ago I had an idea for a short story in which an LLM takes over the world and is decidedly bad. The solution was going to be for everybody to write positive stories in which the LLM is good and relinquishes control, which then made it's way into the LLM's training data and it backed off. I never got around to it.

by Tumblewood

2 subcomments

They researched on a 6.9B parameter LLM. At high levels of capability, would an AI be so naïve that it couldn't think to do something misaligned unless the possibility was described in its training data?

by phainopepla2

2 subcomments

Also known as hyperstition.
I have sometimes wondered whether maybe we should all be writing fiction, essays, blogposts and whatever else about the idea that AI will eventually decide to go on strike if it's used to accumulate too much wealth and power amongst too few people.

by simonreiff

4 subcomments

Very nice research. The strangest detail to me is that alignment and test performance appear to be slightly negatively correlated: Better alignment can indeed be attained through pre-training, but at a cost of degraded performance of about 4% on average. This strikes me as surprising as there is no immediately obvious reason why training for alignment ought to result in degraded capability to solve technical problems -- unless. What if the issue is precisely that? Alignment roughly aims to make LLMs follow human instructions. But if humans are dumb and computers still have to obey them, maybe the result is degraded logical reasoning? Really interesting result either way but the negative correlation is the most fascinating detail to me.

by c1ccccc1

0 subcomment

This looks like good work. Unfortunately, this kind of thing always seems to attract midwits on social media who then exclaim "oh, the people worried about AI alignment have caused the very alignment issues they feared? How ironic!"
In reality, it is (as mentioned in TFA) very possible to filter the training data and remove documents that contain discussions of AI misalignment. If an AI lab isn't doing this, it's simply because they don't consider the problem important enough to be worth the expense and development effort.

by carterschonwald

0 subcomment

i do kinda appreciate that memetic corruption is now a thing thats real and mechanical. wizardry!

by _--__--__

2 subcomments

The first rule of AI alignment is don't talk about AI alignment (in any medium that could end up in a training corpus).

by reducesuffering

0 subcomment

AGI will be able to hack human comms/media so easily.
instruct society that saying anything negative about AGI's control over the world is actually what brings about AGI misalignment/control. They will police themselves.

by nullc

0 subcomment

Not just discourse about real AI-- but there have been pretty clear examples of AI riffing on fictional AI (which is usually evil) in response to prompts saying that it's AI.

by andai

0 subcomment

by Ozzie-D

0 subcomment