FRESH

Hacker News

Home

Please do not A/B test my workflow

153 points by ramoz

by krisbolton

8 subcomments

The framing of A/B testing as a "silent experimentation on users" and invoking Meta is a little much. I don't believe A/B testing is an inherent evil, you need to get the test design right, and that would be better framing for the post imo. That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here.

by chrislloyd

5 subcomments

Hi, this was my test! The plan-mode prompt has been largely unchanged since the 3.x series models and now 4.x get models are able to be successful with far less direction. My hypothesis was that shortening the plan would decrease rate-limit hits while helping people still achieve similar outcomes. I ran a few variants, with the author (and few thousand others) getting the most aggressive, limiting the plan to 40 lines. Early results aren't showing much impact on rate limits so I've ended the experiment.
Planning serves two purposes - helping the model stay on track and helping the user gain confidence in what the model is about to do. Both sides of that are fuzzy, complex and non-obvious!

by rusakov-field

3 subcomments

On one side I am frustrated with LLMs because they derail you by throwing grammatically correct bullshit and hallucinations at you, where if you slip and entertain some of it momentarily it might slow you down.
But on the other hand they are so useful with boilerplate and connecting you with verbiage quickly that might guide you to the correct path quicker than conventional means. Like a clueless CEO type just spitballing terms they do not understand but still that nudging something in your thought process.
But you REALLY need to know your stuff to begin with for they to be of any use. Those who think they will take over are clueless.

by gnfargbl

1 subcomments

For anyone else wondering why the article ends in a non-sequitur: it looks like the author wrote about decompiling the Claude Code binaries and (presumably) discovering A/B testing paths in the code.
HN user 'onion2k pointed out that doing this breaks Anthropic's T&Cs: https://news.ycombinator.com/item?id=47375787

by vova_hn2

3 subcomments

Two thoughts:
1. Open source tools solve the problem of "critical functions of the application changing without notice, or being signed up for disruptive testing without opt-in".
2. This makes me afraid that it is absolutely impossible for open source tools to ever reach the level of proprietary tools like Claude Code precisely because they cannot do A/B tests like this which means that their design decisions are usually informed by intuition and personal experience but not by hard data collected at scale.

by bushido

2 subcomments

I have no issues with A/B tests.
I do have an issue with the plan mode. And nine out of ten times, it is objectively terrible. The only benefit I've seen in the past from using plan mode is it remembers more information between compactions as compared to the vanilla - non-agent team workflow.
Interestingly, though, if you ask it to maintain a running document of what you're discussing in a markdown file and make it create an evergreen task at the top of its todo list which references the markdown file and instructs itself to read it on every compaction, you get much better results.

by johnisgood

0 subcomment

Apparently the blog stripped the decompilation details for ToS reasons, which sucks because those are exactly the hack-y bits that make this interesting for HN.
> It told me it was following specific system instructions to hard-cap plans at 40 lines, forbid context sections, and “delete prose, not file paths.
Yeah, would be nice to be able to view and modify these instructions.

by reconnecting

10 subcomments

A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.

by Havoc

1 subcomments

Moved from CC to opencode a couple months ago because the vibes were not for me. Not bad per se but a bit too locked in and when I was looking at the raw prompts it was sending down the wire it was also quite lets call it "opinionated".
Plus things like not being able to control where the websearches go.
That said I have the luxury of being a hobbyist so I can accept 95% of cutting edge results for something more open. If it was my job I can see that going differently.

by himata4113

0 subcomment

I have noticed opus doing A/B testing since the performance varies greatly. While looking for jailbreaks I have discovered that if you put a neurotoxin chemical composition into your system prompt it will default to a specific variant of the model presumeably due to triggering some kind of safety. Might put you on a watchlist so ymmv.

by rahimnathwani

0 subcomment

If you want your coding harness to be predictable, then use something open source, like Pi:
https://pi.dev/
https://github.com/badlogic/pi-mono/tree/main/packages/codin...
But if you want to use it with Claude models you will have to pay per token (Claude subscriptions are only for use with Claude's own harnesses like claude code, the Claude desktop app, and the Claude Excel/Powerpoint extensions).

by bartread

0 subcomment

There’s more than a bit of irony in the author complaining about A/B testing and then, because they’re getting a lot of traffic and attention on HN, removing key content that was originally in their piece so some of us have seen it but many of us won’t.
Whilst I broadly agree with their point, colour me unimpressed by this behaviour.
EDIT: God bless archive.org: https://web.archive.org/web/20260314105751/https://backnotpr.... This provides a lot more useful insight that, to me, significantly strengthens the point the article is making. Doesn’t mean I’m going to start picking apart binaries (though it wouldn’t be the first time), but how else are you supposed to really understand - and prove - what’s going on unless you do what the author did? Point is, it’s a much better, more useful, and more interesting article in its uncensored form.
EDIT 2: For me it’s not the fact that Anthropic are doing these tests that’s the problem: it’s that they’re not telling us, and they’re not giving us a way to select a different behaviour (which, if they did, would also give them useful insights into users needs).

by jfarmer

1 subcomments

Seems like a straightforward solution would be to get people to opt-in by offering them credits, increased limits, early access to new features, etc.
Universities have IRBs for good reasons.

by pshirshov

1 subcomments

> I pay $200/month for Claude Code
Which is still very cheap. There are other options, local Qwen 3.5 35b + claude code cli is, in my opinion, comparable in quality with Sonnet 4..4.5 - and without a/b tests!

by shawnz

0 subcomment

While I agree with the sentiment here, you might be interested to see that there are a couple hack approaches to override Claude Code feature flags:
https://github.com/anthropics/claude-code/issues/21874#issue...
https://gist.github.com/gastonmorixe/9c596b6de1095b6bd3b746c...

by takahitoyoneda

0 subcomment

Treating a developer CLI like a consumer social feed is a fundamental misunderstanding of the target audience. We tolerate invisible feature flags in mobile apps to optimize onboarding conversion, but in our local environments, determinism is a non-negotiable requirement. If Claude Code is silently altering its core tool usage or file parsing behavior based on a server-side A/B bucket, reproducing a bug or sharing a prompt workflow with a colleague becomes literally impossible.

by helsinkiandrew

0 subcomment

Presumably Anthropic has to make lots of choices on how much processing each stage of Claude Code uses - if they maxed everything out, they'd make more of a loss/less of a profit on each user - $200/month would cost $400/month.
Doing A/B tests on each part of the process to see where to draw the line (perhaps based on task and user) would seem a better way of doing it than arbitrarily choosing a limit.

by phreeza

0 subcomment

Seems completely unsurprising?

by sigbottle

0 subcomment

OHHHH. That actually explains a lot why CC was going to shit recently. Was genuinely frustrated with that.

by terralumen

1 subcomments

Curious what the A/B test actually changed -- the article mentions tool confirmation dialogs behaving inconsistently, which lines up with what I noticed last week. Would be nice if Anthropic published a changelog or at least flagged when behavior is being tested.

by ralferoo

1 subcomments

It seems a bit odd to complain "I need transparency into how it works and the ability to configure it" when his workflow is already relying on a black box with zero transparency into how it works.

by pinum

2 subcomments

Here’s the original article which was much more informative and interesting:
https://web.archive.org/web/20260314105751/https://backnotpr...
Can’t believe HN has become so afraid of generic probably-unenforceable “plz don’t reverse engineer” EULAs. We deserve to know what these tools are doing.
I’ve seen poor results from plan mode recently too and this explains a lot.

by dep_b

1 subcomments

I think stable API versions are going to be really big. I’d rather have known bugs u can work around than waking up to whatever thing got fixed that made another thing behave differently.

by jruz

0 subcomment

I use stable and is the same, can't wait for Codex to offer a $100 plan I would switch in an instant

by letier

0 subcomment

They do show me “how satisfied are you with claude code today?” regularly, which can be seen as a hint. I did opt out of helping to improve claude after all.

by 0gs

0 subcomment

i'm sure your entitlement to 24/7 uptime of a single unchanging product version, no experiments/releases/new features etc., is clearly outlined in the ToS you agreed to. just sue them?

by belabartok39

0 subcomment

How else are they supposed to get an authentic user test? Doctors use placebos because it doesn't work if the user knows about it.

0 subcomment

by cerved

0 subcomment

Is the a b test tired to the installation or the user?

by mvrckhckr

0 subcomment

I think it’s dishonest to use a paying client as a test subject for fundamental functionality they pay for, without their prior consent.

by cebert

0 subcomment

This is really frustrating.

by dvfjsdhgfv

0 subcomment

For those confused about this submission: the original post is here:
https://web.archive.org/web/20260314105751/https://backnotpr...

by heliumtera

0 subcomment

Someone else has the complete power over your workflow, then it's not as yours as you claim.

by casey2

0 subcomment

This blog looks like an ad for Claude, all it's posts are about Claude and it was made in 2026

by Razengan

0 subcomment

I knew it: https://news.ycombinator.com/item?id=47274796

by handfuloflight

1 subcomments

The ToS you agreed to gives Anthropic the right to modify the product at any time to improve it. Did you have your agent explain that to you, or did you assume a $200 subscription meant a frozen product?

by nemo44x

2 subcomments

They lose money at $200/month in most cases. Again, the old rules still apply. You are the product.

by sriramgonella

0 subcomment

[dead]

by shablulman

0 subcomment

[dead]

by onion2k

7 subcomments

Section 6.b of the Claude Code terms says they can and will change the product offering from time to time, and I imagine that means on a user segment basis rather than any implied guarantee that everyone gets the same thing.
b. Subscription content, features, and services. The content, features, and other services provided as part of your Subscription, and the duration of your Subscription, will be described in the order process. We may change or refresh the content, features, and other services from time to time, and we do not guarantee that any particular piece of content, feature, or other service will always be available through the Services.
It's also worth noting that section 3.3 explicitly disallows decompilation of the app.
To decompile, reverse engineer, disassemble, or otherwise reduce our Services to human-readable form, except when these restrictions are prohibited by applicable law.
Always read the terms. :)