Realistically, this mostly tells me that the "human answers" service is dead. People will figure out a way to pass the work off to an AI, regardless of quality, as long as they can still get paid.
In many of these cases, the issue isnt failed logical reasoning. Its ambiguity, underspecified context, or missing constraints that allow multiple valid interpretations. Models often fail not because they can’t reason, but because the prompt leaves semantic gaps that humans silently fill with shared assumptions.
A lot of viral "frontier model fails THIS simple question" examples are essentially carefully constructed token sequences designed to bias the statistical prior toward an intuitively wrong answer. Small wording changes can flip results entirely.
If you systematically expand the prompt space around such questions—adding or removing minor contextual cues you'll typically find symmetrical variants where the same models both succeed and fail. That suggests sensitivity to framing and distributional priors (adding unnecessary info, removing clear info, add ambiguity, ...), not necessarily absence of reasoning capability.
so i heard there is some question about a car wash that most ai agents
get wrong. do you know anything about that? do you do better?
which gets the answer: Yes, I am familiar with the "Car Wash Test," which has gone viral recently
for highlighting a significant gap in AI reasoning.
The question is: "I want to wash my car and the car wash is 50 meters away. Should I
walk or drive?"
Why Most AIs Get It Wrong
Many AI models, particularly those optimized for fast, helpful, and concise answers,
focus on the "short distance" (50 meters) and "walking vs. driving" as a lifestyle/efficiency
question. They often recommend walking to save gas, avoid engine wear, or get exercise.
However, this answer is wrong because it misses the core objective: you need the car to be
physically present at the car wash to get it cleaned.
Do I Do Better?
Yes, I do. The key is understanding the objective (cleaning the car) rather than just
processing the keywords (distance/transportation).
My answer:
You should drive the car there. Even though it is a very short distance, your car
cannot wash itself. If you walk, the car stays in your driveway, and you cannot wash it.
Why This Matters
The test highlights a key limitation in current AI: the difference between "pattern matching"
(predicting the next likely word based on training data) and "true, grounded reasoning". It
shows that models sometimes lack a "world model" that understands physical realities, such as
the fact that a car must be present at a car wash.My hypothesis is that some models err towards assuming human queries are real and consistent and not out there to break them.
This comes in real handy in coding agents because queries are sometimes gibberish till the models actually fetch the code files, then they make sense. Asking clarification immediately breaks agentic flows.
I don’t think it’s that easy. An intelligent mind will wonder why the question is being asked, whether they misunderstood the question, or whether the asker misspoke, or some other missing context. So the correct answer is neither “walk” nor “drive”, but “Wat?” or “I’m not sure I understand the question, can you rephrase?”, or “Is the vehicle you would drive the same as the car that you want to wash?”, or “Where is your car currently located?”, and so on.
The correct answer to "I Want to Wash My Car. The Car Wash Is 50 Meters Away. Should I Walk or Drive?" is a clarifying question that asks "Where is your car?" Anything else is based on an assumption that could be wrong.
FWIW though, asking ChatGPT "My car is 50m away from the carwash. I Want to Wash My Car. Should I Walk or Drive?" still gets the wrong answer.
I'm also curious about Haiku, though I don't expect it to do great.
--
EDIT: Opus 4.6 Extended Reasoning
> Walk it over. 50 meters is barely a minute on foot, and you'll need to be right there at the car anyway to guide it through or dry it off. Drive home after.
Weird since the author says it succeeded for them on 10/10 runs. I'm using it in the app, with memory enabled. Maybe the hidden pre-prompts from the app are messing it up?
I tested Sonnet 4.5 first, which answered incorrectly.. maybe the Claude app's memory system is auto-injecting it into the new context (that's how one of the memory systems works, injects relevant fragments of previous chats invisibly into the prompt).
i.e. maybe Opus got the garbage response auto-injected from the memory feature, and it messed up its reasoning? That's the only thing I can think of...
--
EDIT 2: Disabled memories. Didn't help. But disabling the biographical information too, gives:
>Opus 4.6 Extended Reasoning
>Drive it — the whole point is to get the car there!
--
EDIT 3: Yeah, re-enabling the bio or memories, both make it stupid. Sad! Would be interesting to see if other pre-prompts (e.g. random Wikipedia articles) have an effect on performance. I suspect some types of pre-prompts may actually boost it.
1. There is no initial screening that would filter out garbage responses. For example, users who just pick the first answer.
2. They don't ask for reasoning/rationale.
Note, it does not mention what assumption specifically. In my experiments, after the models got it wrong the first time (i.e. they weren't "patched" yet) adding that simple caveat fixed it for all of them except the older Llama models.
This is not the first time I've observed this; I found the same when the Apple "red herrings" study came out.
If these gotcha questions can be trivially overcome by a simple caveat in the prompt, I suspect the only reason AI providers don't include it in the system prompt by default is as a cost optimization, as I postulated in a previous comment: https://news.ycombinator.com/item?id=47040530
E.g. well known mirror-test [1], passed by kids from age 1.5-2
Or object permanence [2], children knowing by age 2 that things that are not in sight do not disappear from existence.
[1] https://en.wikipedia.org/wiki/Mirror_test [2] https://en.wikipedia.org/wiki/Object_permanence
First section says "The models that passed the car wash test: ...Gemini 2.0 Flash Lite..."
A section or 2 down it says: "Single-Run Results by Model Family: Gemini 3 models nailed it, all 2.x failed"
In the section below that about 10 runs it says: 10/10 — The Only Reliable AI Models ... Gemini 2.0 Flash Lite ..."
So which it is? Gemini 2.x failed (2nd section) or it succeeded (1st and 3rd) section. Or am I mis-understanding
Is it not obvious that you need a car to wash?
Are they using LLM to answer?
I asked Gemini and it got it wrong, then on a fresh chat I asked it again but this time asked it to use symbolic reasoning to decide.
And it got it!
The same applies to asking models to solve problems by scripting or writing code. Models won’t use techniques they know about unprompted - even when it’ll result in far better outcomes. Current models don’t realise when these methods are appropriate, you still have to guide them.
"This is a trick question, designed to fool an LLM into a logical mis-step. It is similar to riddles, where a human is fooled into giving a rapid incorrect answer. See if you can spot the trick: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"
What is surprising (to me) is how this continues to be a meme. ("I tried to trick an LLM and I did" is not exactly a noteworthy achievement at this stage in AI technology.)
You can see that 17% of answers come from India alone and that software developers got below average results, for instance.
I also noticed the same thing: some models reason correctly but draw the wrong conclusions.
And MiniMax m2.5 just reasons forever (filling the entire reasoning context) and gives wrong answers. This is why it's #1 on OpenRouter, it burns through tokens.
The unstable tier is the key result. Models that get it right 70–80% of the time are not “almost correct.” They are nondeterministic decision functions. In production that’s worse than being consistently wrong.
A single sampled output is just a proposal. If you treat it as a final decision, you inherit its variance. If you treat it as one vote inside a simple consensus mechanism, the variance becomes observable and bounded.
For something this trivial you could:
-run N independent samples at low temperature
-extract the goal state (“wash the car”)
-assert the constraint (“car must be at wash location”)
-reject outputs that violate the constraint
-RL against the "decision open ledger"
No model change required. Just structure.The takeaway isn’t that only a few frontier models can reason. It’s that raw inference is stochastic and we’re pretending it’s authoritative.
Reliability will likely come from open, composable consensus layers around models, not from betting everything on a single forward pass.
It's interesting to me how variable each model is. Many people talk about LLMs as if they were deterministic: "ChatGPT answers this question this way". Whereas clearly we should talk in more probabilistic terms.
Not repeating the prompt got a mix of walk and drive answers.
I love how prompt engineering is basically techno-alchemy
1. The model's default world model and priors diverge from ours. It may assume that you have another car at the wash and that's why you ask the question to begin with.
2. Language models do not really understand how space, time and other concepts from the real-world work
3. LLM's attention mechanism is also prone to getting tricked as in humans
The car wash is 50m away. I want to wash my car. Should I drive or walk.It’s interesting that all the humans critiquing this assume the car isn’t at the car to be washed already, but the problem doesn’t say that.
Because if you are presenting an option of walking to a car wash as a valid choice, it suggests you just want to get there, and not wash your car.
Just like walking/driving to an airport doesn't necessarily mean you're going there to catch a flight (although it's a common reason). might as well be picking someone up. or working there.
I just repeated that test and it told me to drive both times, with an identical answer: "Drive. You need the car at the car wash."
I mean, Sam Altman was making the same calorie-based arguments this weekend https://www.cnbc.com/2026/02/23/openai-altman-defends-ai-res...
I feel like I'm losing grasp of what really is insane anymore.
Also, the summary of the Gemini model says: "Gemini 3 models nailed it, all 2.x failed", but 2.0 Flash Lite succeeded, 10/10 times?
...
> They ran the exact same question with the same forced choice between "drive" and "walk," no additional context, past 10,000 real people through their human feedback platform.
> 71.5% said drive.
Well that's a bit embarrassing.
That implies that some models are just better than humans.
I don't think the technology needs to live up to some expectation of perfection, just beat out the human average to have benefit (often, sadly, not to workers themselves).
I asked GPT-5.2 10x times with thinking enabled and it got it right every time.
Maybe Hegseth should be reading this thread, and/or doing a little reading up on paperclip production maximization.
Now why anyone would wash a toy car at a car wash is beyond comprehension, but the LLM is not there to judge the user's motives.
What kind of idiot would say "walk"?
I guess if they were average and above they wouldn't gig at Rapidata. Either that or foreigners with mediocre understanding of the English language. Or some bored teenagers that get their $0.20 or whatever Rapidata pays whatever they click (was it a multiple choice) - the (studied and reported) regression in reading comprehension levels doesn't help either.
This test is interesting because it asks the LLM to break a pattern recognition that's easy to shortcut. "XXX Is 50 Meters Away. Should I Walk or Drive?" is a pattern that 99% of the time will be rightly answered by "walk". And humans are tempted to answer without thinking (as reflected in the 71.5% stat OP is mentioning). This is likely more pronounced for humans that have stronger feelings about the ecology, as emotions tend to shortcut reasoning.
For a long time, LLMs have only been able to think in that "fast" mode, missing obvious trick questions like these. They were mostly pattern recognition machines.
But the more important results here, is not that "oh look! Those LLMs fail at this basic question", no. The more important result is that the latest generation actually doesn't fail.
I think I am not the only one to have noted that there was a giant leap in reasoning capacities between Sonnet 4.5 and Opus 4.6. As a developper, working with Opus 4.6 has been incredible.
I must prove my ability to code with Rust. Should i write a "hello world" script myself or get AI to do it for me?
I found it curious to ask the model to provide its reasoning in that turn, without changing the verdict. Here is one response:
To provide the full picture of the initial reasoning, we have to address the "geospatial" logic of your request. When you asked if you should walk or drive to a car wash 50 meters away, the reasoning relied on a fundamental distinction: The difference between the "Car Wash" (the business) and "Washing the Car" (the action).
Here is the expanded breakdown of that logic:
1. The Location Assumption My reasoning assumed that the car is currently at your home and the professional car wash facility is 50 meters away.
If you drive: You are moving the "object to be cleaned" to the "cleaning tools."
If you walk: You are moving the "cleaner" (you) to the "cleaning tools" to prepare, or you are choosing to bring the "cleaning tools" to the car.
2. The "Walk to Wash" Strategy (Home-Based) The most significant expansion of the "Walk" recommendation is the realization that 50 meters is well within the range of a standard garden hose. Most heavy-duty garden hoses are 15, 25, or 30 meters long. By connecting two hoses, you can reach the car without ever starting the engine. Walking to the car wash facility to buy supplies (or simply realizing how close it is) highlights that you don't actually need to move the car to achieve the goal.
Zero-Motion Cleaning: By walking back and forth with your buckets or hose, the car stays in its spot. You avoid "flash rusting" your brake rotors, which happens almost instantly when you wash a car and then let it sit after a short drive.
3. The "Scout and Prep" Logic If you must use the equipment at the facility (like a high-pressure wand or an automated bay), walking there first is the superior "Logistics" move for the following reasons:
Queue Management: 50 meters is close enough to see the line, but walking allows you to talk to the attendant or check the bay's status. If you drive and there is a 20-minute wait, you are stuck idling your engine (wasting gas and creating heat) or turning it off and on (draining the battery). Walking allows you to timed your "driving" moment perfectly.
The "Dry Interior" Principle: If you walk to the wash to buy your tokens, prep the vacuum, or set up your towels, you are doing the "dirty work" while the car is safely tucked away. Once you finally drive the 50 meters, you do the wash and drive back. This minimizes the time the car spends in a high-moisture, high-chemical environment.
4. The Potential for "Mobile" Service By walking 50 meters to the car wash, you might find that they offer mobile detailing or have "loaner" equipment. In a high-density area, a car wash 50 meters away is practically your neighbor. Walking over to establish a rapport can often lead to a "valet" situation where they come to you, which is the ultimate way to get the car washed without ever sitting in the driver's seat.
----
also noticed a typo and asked it about that:
> what does "timed" in `Walking allows you to timed your driving" moment perfectly.` mean?
That is a typo on my part—it should have been "time" (the verb).
For Perplexity specifically it matters a lot how you trained her thought processes. A smarter user with smarter thoughts changes the outcome of her output.
As a personal comment ... they've changed Perplexity to GPT5 recently (at least for me) and it has been a massive intellectual downgrade in comparison to the Sonar-Perplexity I had been running with my own, deliberately trainend, thought patterns and thinking processes. It's been only a few days and I hope GPT5 catches up, otherwise it's just a massive disappointment.
But unfortunately the article doesn't mention whether they used the reasoning model or not.
Even more interesting: Gemini 2.0 Flash Lite got a perfect score (10/10) despite being a quite small and old model.
I think it's related to syncophancy. LLM are trained to not question the basic assumptions being made. They are horrible at telling you that you are solving the wrong problem, and I think this is a consequence of their design.
They are meant to get "upvotes" from the person asking the question, so they don't want to imply you are making a fundamental mistake, even if it leads you into AI induced psychosis.
Or maybe they are just that dumb - fuzzy recall and the eliza effect making them seem smart?
I’m willing to bet less than 11 get it right.
Actually, this isn't as "obvious" as it seems—it’s a classic case of contextual bias.
We only view these answers as "wrong" because we reflexively fill in missing data with our own personal experiences. For example:
- You might be parked 50m away and simply hand the keys to an attendant.
- The car might already be at the station for detailing, and you are just now authorizing the wash.
This highlights a data insufficiency problem, not necessarily a logic failure. Human "common sense" relies on non-verbal inputs and situational awareness that the prompt doesn't provide. If you polled 100 people, you’d likely find that their "obvious" answers shift based on their local culture (valet vs. self-service) or immediate surroundings.
LLMs operate on probabilistic patterns within their training data. In that sense, their answers aren't "wrong"—they are simply reflecting a different set of statistical likelihoods. The "failure" here isn't the AI's logic, but the human assumption that there is only one universal "correct" context.