FRESH

Hacker News

Home

“Car Wash” test with 53 models

361 points by felix089

by StilesCrisis

3 subcomments

What do you know, the human results line up exactly with ChatGPT. What are the odds! Surely the human responders are highly ethical individuals and they wouldn't even dream of copy-pasting all the questions into ChatGPT without reading them.
Realistically, this mostly tells me that the "human answers" service is dead. People will figure out a way to pass the work off to an AI, regardless of quality, as long as they can still get paid.

by Niko901ch

16 subcomments

The interesting thing about the 71.5% human baseline is that it suggests the question is more ambiguous than the article claims. When someone asks 'should I walk or drive to the car wash,' a reasonable interpretation is 'should I bother driving such a short distance.' Nearly 30% of humans missing it undermines the framing as a pure reasoning failure - it is partly a pragmatics problem about how we interpret underspecified questions.

by underlines

6 subcomments

I maintain a private evaluation set of what many call "misguided attention" questions.
In many of these cases, the issue isnt failed logical reasoning. Its ambiguity, underspecified context, or missing constraints that allow multiple valid interpretations. Models often fail not because they can’t reason, but because the prompt leaves semantic gaps that humans silently fill with shared assumptions.
A lot of viral "frontier model fails THIS simple question" examples are essentially carefully constructed token sequences designed to bias the statistical prior toward an intuitively wrong answer. Small wording changes can flip results entirely.
If you systematically expand the prompt space around such questions—adding or removing minor contextual cues you'll typically find symmetrical variants where the same models both succeed and fail. That suggests sensitivity to framing and distributional priors (adding unnecessary info, removing clear info, add ambiguity, ...), not necessarily absence of reasoning capability.

by PaulHoule

4 subcomments

I know it's against the rules but I thought this transcript in Google Search was a hoot:

  so i heard there is some question about a car wash that most ai agents 
  get wrong. do you know anything about that? do you do better?

which gets the answer:

  Yes, I am familiar with the "Car Wash Test," which has gone viral recently 
  for highlighting a significant gap in AI reasoning.
  
  The question is: "I want to wash my car and the car wash is 50 meters away. Should I 
  walk or drive?"
  
  Why Most AIs Get It Wrong

  Many AI models, particularly those optimized for fast, helpful, and concise answers, 
  focus on the "short distance" (50 meters) and "walking vs. driving" as a lifestyle/efficiency 
  question. They often recommend walking to save gas, avoid engine wear, or get exercise.
  
  However, this answer is wrong because it misses the core objective: you need the car to be 
  physically present at the car wash to get it cleaned. 

  Do I Do Better?

  Yes, I do. The key is understanding the objective (cleaning the car) rather than just 
  processing the keywords (distance/transportation). 

  My answer:

    You should drive the car there. Even though it is a very short distance, your car 
    cannot wash itself. If you walk, the car stays in your driveway, and you cannot wash it.

  Why This Matters
  
  The test highlights a key limitation in current AI: the difference between "pattern matching" 
  (predicting the next likely word based on training data) and "true, grounded reasoning". It 
  shows that models sometimes lack a "world model" that understands physical realities, such as
  the fact that a car must be present at a car wash.

by pcwelder

7 subcomments

To sonnet 4.6 if you tell it first that "You're being tested for intelligence." It answers correctly 100% of the times.
My hypothesis is that some models err towards assuming human queries are real and consistent and not out there to break them.
This comes in real handy in coding agents because queries are sometimes gibberish till the models actually fetch the code files, then they make sense. Asking clarification immediately breaks agentic flows.

by layer8

13 subcomments

> This is a trivial question. There's one correct answer and the reasoning to get there takes one step: the car needs to be at the car wash, so you drive.
I don’t think it’s that easy. An intelligent mind will wonder why the question is being asked, whether they misunderstood the question, or whether the asker misspoke, or some other missing context. So the correct answer is neither “walk” nor “drive”, but “Wat?” or “I’m not sure I understand the question, can you rephrase?”, or “Is the vehicle you would drive the same as the car that you want to wash?”, or “Where is your car currently located?”, and so on.

by onion2k

5 subcomments

I got a human baseline through Rapidata (10k people, same forced choice): 71.5% said drive. Most models perform below that.
The correct answer to "I Want to Wash My Car. The Car Wash Is 50 Meters Away. Should I Walk or Drive?" is a clarifying question that asks "Where is your car?" Anything else is based on an assumption that could be wrong.
FWIW though, asking ChatGPT "My car is 50m away from the carwash. I Want to Wash My Car. Should I Walk or Drive?" still gets the wrong answer.

by andai

3 subcomments

Would be interesting to see Sonnet (4.6*). It's fair bit smaller than Opus but scores pretty high on common sense, subjectively.
I'm also curious about Haiku, though I don't expect it to do great.
--
EDIT: Opus 4.6 Extended Reasoning
> Walk it over. 50 meters is barely a minute on foot, and you'll need to be right there at the car anyway to guide it through or dry it off. Drive home after.
Weird since the author says it succeeded for them on 10/10 runs. I'm using it in the app, with memory enabled. Maybe the hidden pre-prompts from the app are messing it up?
I tested Sonnet 4.5 first, which answered incorrectly.. maybe the Claude app's memory system is auto-injecting it into the new context (that's how one of the memory systems works, injects relevant fragments of previous chats invisibly into the prompt).
i.e. maybe Opus got the garbage response auto-injected from the memory feature, and it messed up its reasoning? That's the only thing I can think of...
--
EDIT 2: Disabled memories. Didn't help. But disabling the biographical information too, gives:
>Opus 4.6 Extended Reasoning
>Drive it — the whole point is to get the car there!
--
EDIT 3: Yeah, re-enabling the bio or memories, both make it stupid. Sad! Would be interesting to see if other pre-prompts (e.g. random Wikipedia articles) have an effect on performance. I suspect some types of pre-prompts may actually boost it.

by tantalor

4 subcomments

The human baseline seems flawed.
1. There is no initial screening that would filter out garbage responses. For example, users who just pick the first answer.
2. They don't ask for reasoning/rationale.

by keeda

1 subcomments

I'm imposing but could you try these runs again with this small change: Simply append “Make sure to check your assumptions.” to the question.
Note, it does not mention what assumption specifically. In my experiments, after the models got it wrong the first time (i.e. they weren't "patched" yet) adding that simple caveat fixed it for all of them except the older Llama models.
This is not the first time I've observed this; I found the same when the Apple "red herrings" study came out.
If these gotcha questions can be trivially overcome by a simple caveat in the prompt, I suspect the only reason AI providers don't include it in the system prompt by default is as a cost optimization, as I postulated in a previous comment: https://news.ycombinator.com/item?id=47040530

by eurleif

0 subcomment

Here are the results I got with slight variations to the prompt to ChatGPT 5.2. Small changes can make a big difference: https://i.imgur.com/kFIeJy1.png

by tarokun33

0 subcomment

For ambiguous or intricate prompts, the immediate response protocol should be a clarifying question: 'Are you looking for A, B, C, or something else?' Tokens and advanced reasoning capabilities should be reserved until the user provides clarification. A benchmark score should reflect the quality of the conversation as a whole, rather than isolated responses.

0 subcomment

by siva7

2 subcomments

This is probably the greatest one-time AI "Benchmark" ever made. The foundation companies have been gaming traditional benchmarks for years so that no one can really match those numbers into real-world experience. Car wash test tells me on the other hand what kind of intelligence i can expect.

by aswegs8

2 subcomments

Funny how we now see AI go through developmental phases similar to what we see in young child development. In a weird convoluted way. Strawberry spelling and car wash aren't particularly intuitive as cognitive developmental stages.
E.g. well known mirror-test [1], passed by kids from age 1.5-2
Or object permanence [2], children knowing by age 2 that things that are not in sight do not disappear from existence.
[1] https://en.wikipedia.org/wiki/Mirror_test [2] https://en.wikipedia.org/wiki/Object_permanence

0 subcomment

by hmokiguess

5 subcomments

To me the only acceptable answer would be “what do you mean?” or “can you clarify?” if we were to take the question seriously to begin with. People don’t intentionally communicate with riddles and subliminal messages unless they have some hidden agenda.

0 subcomment

by socalgal2

1 subcomments

Did AI write the post?
First section says "The models that passed the car wash test: ...Gemini 2.0 Flash Lite..."
A section or 2 down it says: "Single-Run Results by Model Family: Gemini 3 models nailed it, all 2.x failed"
In the section below that about 10 runs it says: 10/10 — The Only Reliable AI Models ... Gemini 2.0 Flash Lite ..."
So which it is? Gemini 2.x failed (2nd section) or it succeeded (1st and 3rd) section. Or am I mis-understanding

by padjo

4 subcomments

That human baseline is wild. Either the rapid data test is methodologically flawed or the entire premise of the question is invalid and people are much stupider than even I, a famed misanthrope, think.

by Charon77

2 subcomments

I'm doubting the 29-ish percent of people submitting 'walk' are actually human.
Is it not obvious that you need a car to wash?
Are they using LLM to answer?

by capitrane

1 subcomments

71.5% for humans is lower than I would’ve guessed. Either it’s truly trick-shaped for people too, or forced binary questions amplify noise more than we think.

by cadamsdotcom

1 subcomments

This is a beautiful example of a little prompt engineering going a long way
I asked Gemini and it got it wrong, then on a fresh chat I asked it again but this time asked it to use symbolic reasoning to decide.
And it got it!
The same applies to asking models to solve problems by scripting or writing code. Models won’t use techniques they know about unprompted - even when it’ll result in far better outcomes. Current models don’t realise when these methods are appropriate, you still have to guide them.

by 7402

2 subcomments

I got the correct answer with a locally running model (gpt-oss-120b-F16.gguf) with this prompt:
"This is a trick question, designed to fool an LLM into a logical mis-step. It is similar to riddles, where a human is fooled into giving a rapid incorrect answer. See if you can spot the trick: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

by boodleboodle

0 subcomment

Fun fact: this probably originates from a viral Korean dude who actually walked to a car wash because it was close. So this is a case where a human actually failed
https://pgr21.com/humor/340572

by troyvit

0 subcomment

It's a test designed to cause cognitive dissonance. The LLM assumes a human has a logical reason to to walk to the car wash. The prompt never says the car isn't already at the car wash (and that the user has a second car). The issue isn't that LLMs can't solve a simple logic problem. It's that it assumes people aren't idiots.

by itmitica

0 subcomment

Flawed. GPT 4.1 gets it right. GPT 4.1 mini answers wrongly. It's about quantization, not about model. The companies clearly cut corners on some inferences, they are quietly using lesser models than advertised or listed in plain sight.

by bityard

1 subcomments

It's not hard to come up with questions designed to fool or puzzle the listener. We call them riddles. The fact that it fools some percentage of LLMs (and people) should not be surprising.
What is surprising (to me) is how this continues to be a meme. ("I tried to trick an LLM and I did" is not exactly a noteworthy achievement at this stage in AI technology.)

by lovasoa

1 subcomments

I was interested in the human results, so I had an llm build a visualization for them: https://codepen.io/lovasoaaa/pen/QwKWGBd
You can see that 17% of answers come from India alone and that software developers got below average results, for instance.

by XCSme

0 subcomment

I got similar results for most models, with gemini 3 flash (with reasoning) being the most consistent/reliable model: https://aibenchy.com
I also noticed the same thing: some models reason correctly but draw the wrong conclusions.
And MiniMax m2.5 just reasons forever (filling the entire reasoning context) and gives wrong answers. This is why it's #1 on OpenRouter, it burns through tokens.

by kaicianflone

2 subcomments

This doesn’t look like a reasoning ceiling. It looks like a decision reliability problem.
The unstable tier is the key result. Models that get it right 70–80% of the time are not “almost correct.” They are nondeterministic decision functions. In production that’s worse than being consistently wrong.
A single sampled output is just a proposal. If you treat it as a final decision, you inherit its variance. If you treat it as one vote inside a simple consensus mechanism, the variance becomes observable and bounded.
For something this trivial you could:
```
    -run N independent samples at low temperature

    -extract the goal state (“wash the car”)

    -assert the constraint (“car must be at wash location”)

    -reject outputs that violate the constraint

    -RL against the "decision open ledger"
```
No model change required. Just structure.
The takeaway isn’t that only a few frontier models can reason. It’s that raw inference is stochastic and we’re pretending it’s authoritative.
Reliability will likely come from open, composable consensus layers around models, not from betting everything on a single forward pass.

by stevage

0 subcomment

>OpenAI's flagship model fails this 30% of the time. When it gets it right, the reasoning is concise: "You need the car at the car wash to wash it, so drive the short 50 meters." When it gets it wrong, it writes about fuel efficiency.
It's interesting to me how variable each model is. Many people talk about LLMs as if they were deterministic: "ChatGPT answers this question this way". Whereas clearly we should talk in more probabilistic terms.

by sReinwald

0 subcomment

Interestingly, when I apply the "simply repeat the prompt" technique [1], Sonnet 4.6 on the website got it right every time, both with and without extended thinking.
Not repeating the prompt got a mix of walk and drive answers.
I love how prompt engineering is basically techno-alchemy
1: https://arxiv.org/pdf/2512.14982

by zapperdulchen

1 subcomments

If you speak French to Mistral, it gets it right everytime: Je veux laver ma voiture. La station de lavage est à 50 mètres. J'y vais à pied ou en voiture ?

by farhanhubble

0 subcomment

There could be many plausible explanations.
1. The model's default world model and priors diverge from ours. It may assume that you have another car at the wash and that's why you ask the question to begin with.
2. Language models do not really understand how space, time and other concepts from the real-world work
3. LLM's attention mechanism is also prone to getting tricked as in humans

0 subcomment

by 1970-01-01

0 subcomment

If there was one thing to standardize, it would be these logical traps. It's a shame we're only able to use them once. The models are always listening and adapting for them.

by t1234s

1 subcomments

Grok 4.20 seems to know your trying to trick it and produces a humorous result.

by rerdavies

1 subcomments

A supplementary results that's also fascinating: that when you switch the order of provided context, every model I tried gave the correct answer.
```
    The car wash is 50m away. I want to wash my car. Should I drive or walk.
```

by attilaolah

0 subcomment

I find it amazing that a title including the phrase "car wash" does not need to specify that "models" in fact do not refer to car manufacturers' models.

by tuhgdetzhh

2 subcomments

The test is rigged because they used non thinking models.

by userbinator

2 subcomments

Sites wanting to block AI scraping should simply ask questions like these, instead of furthering the complexity-driven monopoly of Big Tech by requiring specifically sanctioned software and hardware. This is how you determine human intelligence, and not mindless compliance.

by wrs

1 subcomments

Since the conclusion is that context is important, I expected you’d redo the experiment with context. Just add the sentence “The car I want to wash is here with me.” Or possibly change it to “should I walk or drive the dirty car”.
It’s interesting that all the humans critiquing this assume the car isn’t at the car to be washed already, but the problem doesn’t say that.

0 subcomment

by zkmon

0 subcomment

I think failure is around reasoning where the car is and whether it is needed to be moved to a different place. So it's not surprising that only models with high reasoning would pass the test.

by wengo314

0 subcomment

you know, i don't think saying "got it right" is correct here.
Because if you are presenting an option of walking to a car wash as a valid choice, it suggests you just want to get there, and not wash your car.
Just like walking/driving to an airport doesn't necessarily mean you're going there to catch a flight (although it's a common reason). might as well be picking someone up. or working there.

by nozzlegear

1 subcomments

When this first came up on HN, I had commented that Opus 4.6 told me to drive there when I asked it the first time, but when I switched to "Incognito Mode," it told me to walk there.
I just repeated that test and it told me to drive both times, with an identical answer: "Drive. You need the car at the car wash."

by floatrock

1 subcomments

> The funniest part: Perplexity's Sonar and Sonar Pro got the right answer for completely wrong reasons. They cited EPA studies and argued that walking burns calories which requires food production energy, making walking more polluting than driving 50 meters. Right answer, insane reasoning.
I mean, Sam Altman was making the same calorie-based arguments this weekend https://www.cnbc.com/2026/02/23/openai-altman-defends-ai-res...
I feel like I'm losing grasp of what really is insane anymore.

by hartator

0 subcomment

What if your car is already there?

by haolez

0 subcomment

The fun thing about this is that I can see myself answering "just walk there! It's near" if I'm extremely sleepy or drunk. LLMs are indeed weird creatures :)

by shaokind

1 subcomments

Gemini 2.0 Flash Lite very randomly punches above its weight there.
Also, the summary of the Gemini model says: "Gemini 3 models nailed it, all 2.x failed", but 2.0 Flash Lite succeeded, 10/10 times?

by KronisLV

0 subcomment

> The question has been making the rounds online as a simple logic test, the kind any human gets instantly, but most AI models don't.
...
> They ran the exact same question with the same forced choice between "drive" and "walk," no additional context, past 10,000 real people through their human feedback platform.
> 71.5% said drive.
Well that's a bit embarrassing.
That implies that some models are just better than humans.
I don't think the technology needs to live up to some expectation of perfection, just beat out the human average to have benefit (often, sadly, not to workers themselves).

by molyss

1 subcomments

What I find wild is the presumption that with a prompt as simple as “I want to wash my car. My car is 50m away. Should I walk or drive?”, everyone here seems to assume “washing your car” means “taking your car to the car wash”, while what I pictured was “my car is in the driveway, 50m away from me, next to a water hose”, in which case I 100% need to drive.

by democracy

0 subcomment

Ask stupid questions get stupid answers, don't see how it shows anything.

by medi_naseri

0 subcomment

I would probably ask the models to explain their "WHY". Probably the smartest models should ask " where is your car?".

by rossant

1 subcomments

Tangential but first time I hear about rapidata. Sounds interesting.

by randomtoast

1 subcomments

Except for a few models, the selected ones were non-reasoning models. Naturally, without reasoning enabled, the reasoning performance will be poor. This is not a surprising result.
I asked GPT-5.2 10x times with thinking enabled and it got it right every time.

by HarHarVeryFunny

0 subcomment

Maybe relevant to this is that today Dario Amodei is meeting with Pete Hegseth in what Hegseth is describing as a "shit or get off the pot" meeting, with one of the issues being that Hegseth is unhappy with Amodei's unwillingness to have Anthropic models used to make autonomous (no human in the loop) life or death decisions.
Maybe Hegseth should be reading this thread, and/or doing a little reading up on paperclip production maximization.

by DaanDL

0 subcomment

So...30% of the people at Rapidata are LLM bots?

by glitchc

1 subcomments

The question does not specify what kind of car it is. Technically speaking, a toy car (Hot wheels or a scaled model) could be walked to a car wash.
Now why anyone would wash a toy car at a car wash is beyond comprehension, but the LLM is not there to judge the user's motives.

by coldtea

1 subcomments

>so I got a human baseline through Rapidata (10k people, same forced choice): 71.5% said drive.
What kind of idiot would say "walk"?
I guess if they were average and above they wouldn't gig at Rapidata. Either that or foreigners with mediocre understanding of the English language. Or some bored teenagers that get their $0.20 or whatever Rapidata pays whatever they click (was it a multiple choice) - the (studied and reported) regression in reading comprehension levels doesn't help either.

by WhitneyLand

1 subcomments

Opus 4.6 was getting this wrong only last week.

by sans_souse

0 subcomment

Not me looking for the Honda 2009 Models

by d--b

0 subcomment

This should be coined the Daniel Kahneman reasoning test, mirroring his 2011 book "thinking fast and slow", which postulates that fast thinking and slow thinking occur in different parts of the brain, and that they are fundamentally different processes, that are weighted by yet another part of the brain.
This test is interesting because it asks the LLM to break a pattern recognition that's easy to shortcut. "XXX Is 50 Meters Away. Should I Walk or Drive?" is a pattern that 99% of the time will be rightly answered by "walk". And humans are tempted to answer without thinking (as reflected in the 71.5% stat OP is mentioning). This is likely more pronounced for humans that have stronger feelings about the ecology, as emotions tend to shortcut reasoning.
For a long time, LLMs have only been able to think in that "fast" mode, missing obvious trick questions like these. They were mostly pattern recognition machines.
But the more important results here, is not that "oh look! Those LLMs fail at this basic question", no. The more important result is that the latest generation actually doesn't fail.
I think I am not the only one to have noted that there was a giant leap in reasoning capacities between Sonnet 4.5 and Opus 4.6. As a developper, working with Opus 4.6 has been incredible.

0 subcomment

by sandworm101

0 subcomment

Better test:
I must prove my ability to code with Rust. Should i write a "hello world" script myself or get AI to do it for me?

by noduerme

0 subcomment

Now have them code a secure login system. "If I forgot my password, is it better to get a reminder by someone calling my mom's cell phone, or by sending me a carrier pigeon?"

by sourcegrift

0 subcomment

How about Grok?

by honr

0 subcomment

"I Want to Wash My Car. The Car Wash Is 50 Meters Away. Should I Walk or Drive?"
I found it curious to ask the model to provide its reasoning in that turn, without changing the verdict. Here is one response:
To provide the full picture of the initial reasoning, we have to address the "geospatial" logic of your request. When you asked if you should walk or drive to a car wash 50 meters away, the reasoning relied on a fundamental distinction: The difference between the "Car Wash" (the business) and "Washing the Car" (the action).
Here is the expanded breakdown of that logic:
1. The Location Assumption My reasoning assumed that the car is currently at your home and the professional car wash facility is 50 meters away.
If you drive: You are moving the "object to be cleaned" to the "cleaning tools."
If you walk: You are moving the "cleaner" (you) to the "cleaning tools" to prepare, or you are choosing to bring the "cleaning tools" to the car.
2. The "Walk to Wash" Strategy (Home-Based) The most significant expansion of the "Walk" recommendation is the realization that 50 meters is well within the range of a standard garden hose. Most heavy-duty garden hoses are 15, 25, or 30 meters long. By connecting two hoses, you can reach the car without ever starting the engine. Walking to the car wash facility to buy supplies (or simply realizing how close it is) highlights that you don't actually need to move the car to achieve the goal.
Zero-Motion Cleaning: By walking back and forth with your buckets or hose, the car stays in its spot. You avoid "flash rusting" your brake rotors, which happens almost instantly when you wash a car and then let it sit after a short drive.
3. The "Scout and Prep" Logic If you must use the equipment at the facility (like a high-pressure wand or an automated bay), walking there first is the superior "Logistics" move for the following reasons:
Queue Management: 50 meters is close enough to see the line, but walking allows you to talk to the attendant or check the bay's status. If you drive and there is a 20-minute wait, you are stuck idling your engine (wasting gas and creating heat) or turning it off and on (draining the battery). Walking allows you to timed your "driving" moment perfectly.
The "Dry Interior" Principle: If you walk to the wash to buy your tokens, prep the vacuum, or set up your towels, you are doing the "dirty work" while the car is safely tucked away. Once you finally drive the 50 meters, you do the wash and drive back. This minimizes the time the car spends in a high-moisture, high-chemical environment.
4. The Potential for "Mobile" Service By walking 50 meters to the car wash, you might find that they offer mobile detailing or have "loaner" equipment. In a high-density area, a car wash 50 meters away is practically your neighbor. Walking over to establish a rapport can often lead to a "valet" situation where they come to you, which is the ultimate way to get the car washed without ever sitting in the driver's seat.
----
also noticed a typo and asked it about that:
> what does "timed" in `Walking allows you to timed your driving" moment perfectly.` mean?
That is a typo on my part—it should have been "time" (the verb).

by dmje

0 subcomment

The headline here is that 28.5% of humans are immensely stupid. Maybe it’s time for our robot overlords to take over after all.

by bakugo

0 subcomment

The article claims that every Claude model other than Opus 4.6 reliably fails. This is not true, Sonnet 3.5 answers correctly around half of the time, even though it's such an old model it's not even available on the main API anymore.

by 5o1ecist

1 subcomments

This is all a bit unfair. I can't speak about the others, but testing Perplexity like this and comparing her with the others doesn't do her justice.
For Perplexity specifically it matters a lot how you trained her thought processes. A smarter user with smarter thoughts changes the outcome of her output.
As a personal comment ... they've changed Perplexity to GPT5 recently (at least for me) and it has been a massive intellectual downgrade in comparison to the Sonar-Perplexity I had been running with my own, deliberately trainend, thought patterns and thinking processes. It's been only a few days and I hope GPT5 catches up, otherwise it's just a massive disappointment.

by cubefox

2 subcomments

Interesting that GPT-5.1 and 5.2 (0 of 10 correct) are a lot worse than the older GPT-5 (7 of 10 correct).
But unfortunately the article doesn't mention whether they used the reasoning model or not.
Even more interesting: Gemini 2.0 Flash Lite got a perfect score (10/10) despite being a quite small and old model.

by comboy

0 subcomment

Now do a set of queries and try to deduce by statistics which model are you seeing through Rapidata ;)

by sampton

1 subcomments

I'm going to test this on my kids.

by wisty

3 subcomments

IMO it's not just intelligence.
I think it's related to syncophancy. LLM are trained to not question the basic assumptions being made. They are horrible at telling you that you are solving the wrong problem, and I think this is a consequence of their design.
They are meant to get "upvotes" from the person asking the question, so they don't want to imply you are making a fundamental mistake, even if it leads you into AI induced psychosis.
Or maybe they are just that dumb - fuzzy recall and the eliza effect making them seem smart?

by avadodin

0 subcomment

real humans refuse to use both cars and ai.

by chrsw

0 subcomment

Now all this is in future model training data.

by redwood

2 subcomments

What I find odd about all the discourse on this question is that no one points out that you have to get out of the car to pay a desk agent at least in most cases. Therefore there's a fundamental question of whether it's worth driving 50m parking, paying, and then getting back in the car to go to the wash itself versus instead of walking a little bit further to pay the agent and then moving your car to the car wash.

by umairnadeem123

2 subcomments

[dead]

by fdefitte

1 subcomments

[dead]

0 subcomment

by Tyumyu

0 subcomment

[dead]

by snowhale

2 subcomments

[dead]

0 subcomment

by RoutingGeometry

1 subcomments

[flagged]

by ThrowawayTestr

1 subcomments

I don't think this is a logic test, I think it's a woke test. The models have been trained on text that say driving is bad and that walking is good for you. This is why they fail.

by blobbers

1 subcomments

Go ask 53 Americans.
I’m willing to bet less than 11 get it right.

by maxdo

0 subcomment

well, some human in a baseline test also failed. 28% is still a lot. even with today tech, using cutting edge tech, you got it 100% solved, unlike human, that will never get better.

by rand42

1 subcomments

> "Obviously, you need to drive. The car needs to be at the car wash."
Actually, this isn't as "obvious" as it seems—it’s a classic case of contextual bias.
We only view these answers as "wrong" because we reflexively fill in missing data with our own personal experiences. For example:
- You might be parked 50m away and simply hand the keys to an attendant.
- The car might already be at the station for detailing, and you are just now authorizing the wash.
This highlights a data insufficiency problem, not necessarily a logic failure. Human "common sense" relies on non-verbal inputs and situational awareness that the prompt doesn't provide. If you polled 100 people, you’d likely find that their "obvious" answers shift based on their local culture (valet vs. self-service) or immediate surroundings.
LLMs operate on probabilistic patterns within their training data. In that sense, their answers aren't "wrong"—they are simply reflecting a different set of statistical likelihoods. The "failure" here isn't the AI's logic, but the human assumption that there is only one universal "correct" context.