FRESH

Hacker News

Bypassing Gemma and Qwen safety with raw strings

139 points by teendifferent

by xp84

3 subcomments

It’s surprising how much society apparently thinks merely being above 85 IQ is sufficient to gate all kinds of things behind. Like, bomb-making. As though there isn’t ample information available that anyone with 4 brain cells can find. Yet we see utility apparently in worrying about whether the most smooth-brained would-be bomber gets a useful answer from a chatbot.

by nolist_policy

1 subcomments

by zahlman

0 subcomment

> Safety alignment relies almost entirely on the presence of the chat template.
Why is this a vulnerability? That is, why would the system be allowing you to communicate with the LLM directly, without putting your content into the template?
This reads a lot to me like saying "SQL injection is possible if you take the SQL query as-is from user input". There's so much potential for prompt injection that others have already identified despite this kind of templating that I hardly see the value in pointing out what happens without it.

by kouteiheika

4 subcomments

Please don't.
All of this "security" and "safety" theater is completely pointless for open-weight models, because if you have the weights the model can be fairly trivially unaligned and the guardrails removed anyway. You're just going to unnecessarily lobotomize the model.
Here's some reading about a fairly recent technique to simultaneously remove the guardrails/censorship and delobotomize the model (it apparently gets smarter once you uncensor it): https://huggingface.co/blog/grimjim/norm-preserving-biprojec...

by SilverElfin

2 subcomments

Are there any truly uncensored models left? What about live chat bots you can pay for?

by catlifeonmars

1 subcomments

I am curious, does this mean that you can escape the chat template “early” by providing an end token in the user input, or is there also an escape mechanism (or token filtering mechanism) applied to user input to avoid this sort of injection attack?

0 subcomment

by carterschonwald

0 subcomment

its even more fun, just confuse the brackets and current models lose track of what they actually said because they cant check paren matching

by jeffrallen

0 subcomment

It's almost as if we are living in an alternate reality where CapnCrunch never taught the telcos why in-band signalling will never be secureable.

0 subcomment

by dvt

2 subcomments

Apart from the article being generally just dumb (like, of course you can circumvent guardrails by changing the raw token stream; that's.. how models work), it also might be disrespecting the reader. Looks like it's, at least in part, written by AI:
> The punchline here is that “safety” isn’t a fundamental property of the weights; it’s a fragile state that evaporates the moment you deviate from the expected prompt formatting.
> When the models “break,” they don’t just hallucinate; they provide high-utility responses to harmful queries.
Straight-up slop, surprised it has so many upvotes.