OpenAI’s o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors
- I'd be very very hesitant to trust studies like this. It's very easy to mess up these benchmarks.
See for example this recent paper where AI managed to beat radiologists on interpreting x-rays... when the AI didn't even have access to the x-rays: https://arxiv.org/pdf/2603.21687 (on a pre existing "large scale visual question answering benchmark for generalist chest x-ray understanding" that wasn't intentionally messed up).
And in interpreting x-ray's human radiologists actually do just look at the x-rays. In the context the article is discussing the human doctors don't just look at the notes to diagnose the ER patient. You're asking them to perform a task that isn't necessary, that they aren't experienced in, or trained in, and then saying "the AI outperforms them". Even if the notes aren't accidentally giving away the answer through some weird side channel, that's not that surprising.
Which isn't to say that I think the study is either definitely wrong, or intentionally deceptive. Just that I wouldn't draw strong conclusions from a single study here.
- I'm surprised at both the article and the paper - both seem very hyperbolic. This is LLMs competing against doctors in a way that is heavily weighted in the LLMs favour, which does not represent clinical practice. These reasoning cases are not benchmarks for doctors, they are learning tools.
I think it's important to note that diagnosis also relies on accurate description of the patient in the first place, and the information you gather depends on the differential diagnosis. Part of the skill of being a doctor is gathering information from lots of different sources, and trying to filter out what is important. This may be from the patient, who may not be able to communicate clearly or may be non verbal, carers and next of kin. History-taking is a skill in itself, as well as examination. Here those data are given.
For pattern recognition from plain text, especially on questions that may be in the o1's training data, I'm not surprised at all that it would outperform doctors, but it doesn't seem to be a clinically useful comparison. Deciding which investigations to do, any imaging, and filtering out unnecessary information from the history is a skill in itself, and can't really be separated from forming the diagnosis.
by programmertote
2 subcomments
- My spouse is an hematologist+oncologist. She and all of her coworkers use ChatGPT. Before then, they look stuff up on UpToDate [ https://www.uptodate.com/login ] (they sometimes still do). I went to medical school for three years and quit because I couldn't stand the rote memorization part of the studies. Too many facts to remember IMO.
Even as an AI-neutral person, I'm very confident that AI/ML based computer systems, once trained specifically for medicine, will consistently do better than human doctors because believe it or not, there are a lot of human errors made in medicine field (doctors just don't admit that and we don't know) due to lack of time by doctors or incompetence or simply forgetting a fact or two that they should have checked when diagnosing or coming up with a treatment.
by creativeSlumber
13 subcomments
- > "An AI and a pair of human doctors were each given the same standard electronic health record to read"
This is handicapping the human doctors abilities. There is a lot more information a human doctor can gather even with a brief observation of the patient.
- Besides for myself and wife, I've also used LLMs to diagnose my dogs. Convinced there's a huge opportunity for AI based veterinary, especially one which then performs bidding across the local veterinary clinics to perform the care/surgeries. I've noticed that local vets vary in price by more than an order of magnitude. My 80 year old mother and mother inlaw have been regularly scammed by over charging vets, and with their dogs being a major part of their lives, they extremely susceptible to pressure.
- I wouldn't put much weight in this study, but I think a lot of us can still attest to the usefulness of LLMs in self-diagnostics. The reality in the US is that it is difficult to get the attention and care of a doctor so we're left having to do it ourselves. 10 years ago you'd hear docs complaining about patients coming in with things they found on google but now I don't think there's an alternative.
Case in point, I went to a podiatrist for foot and ankle issues. He diagnosed my foot issues from the xray but just shrugged his shoulders for the ankle issues and said the xray didn't show anything. My 15 minute allocation of his attention expired and I left without a clue as to the issue or what corrective actions to take. 5 minutes with an LLM and I had a plausible reason for the ankle issues which aligned with the diagnosis in my foot.
- Hyped title. It was exclusively text-based diagnosis after physicians did the whole interview, exam, labs, etc.
Also, later in the encounter, with more chart information, AI scored 82%, physicians 70–79%; that difference was reportedly not statistically significant.
So current AI can aid in diagnosing like we've all known.
- If this is repeatable and holds true across testing groups and practitioners that would be amazing! Doctors could finally spend time with patients rather than rushing to probe, document, test and diagnose. They are so pressed to maximize their time that any time back could go straight into real care. Am I being blindly optimistic here?
- It would have been interesting to see how a doctor with access to LLMs would perform, compared to only LLMs and only doctors.
If doctors with LLM access still score 67%, then someone with no medical knowledge could potentially score the same, which would make ER triage a replaceable task by AI. But I am sure that is not the case. Competent doctors with the background they have can use LLMs to brainstorm and analyze different paths and score higher.
- I know a cardiologist who founded a training & knowledge base startup for doctors. He once told me (that was before LLMs), that it’s super common to tell a patient that the doc needs to look up sthg in their patient history, to then instead google the symptoms. Or, even more often, quickly text a colleague.
I have no way of knowing if this is true. But I‘d rather had a complete, guided prompt be the basis of a diagnosis, than a 2m google search.
- Obviously annecdotal, but a couple years ago my friends kid was sick, and doctors were trying to figure out what was going on. My friend threw the symptoms and test results into ChatGPT, and it said the likely cause was leukemia. A few hours later the doctors handed them an official leukemia diagnosis.
I think AI, like in all other fields, will become a great tool to help augment. Throw the patient data in and get a response and that can be the first thing the doctor checks for, but they shouldn't simply take AI as truth.
P.S. friends kid is doing great - it was caught early enough. They are due to be completely done with treatment in just a couple months!
- Not long ago I started having an issue with my eye. I called around and they said I should get seen ASAP, same day if possible, but it wasn’t worth the ER and it was a five day wait for an appointment.
I was pretty freaked out. During that time, I tried diagnosing it with AI. When I finally got to the appointment, the actual doctor sat down, looked at all the unremarkable images, asked me one (1) question, ordered another image and diagnosed the issue. When I looked back, in all that time, the AI had mentioned it exactly one time early on, ruled it out immediately based on a flawed understanding of the symptoms, and never brought it up again.
Just my anecdotal evidence, but I’d never trust any AI on its own. My doctor can use it if they want, I can’t.
by OptionOfT
3 subcomments
- As a 37 year old male with 2 THRs I'm glad the AI was NOT used in my diagnosis. All the models that I used to look at my x-rays said nothing was wrong, even when adding symptoms. When adding age it said the patient was too young.
(I was ~3 months away from wheelchair bound in those x-rays).
The worst one was Gemini. Upload an x-ray of just the right hip, and it started to talk about how good the left hip looked like.
I think with AI taking over it's gonna be harder to get a solution when your problem isn't the run-of-the mill.
- o1 is several generations old and was released in 2024. Is this some quite old research that took a long time to get published?
- The paper: https://www.science.org/doi/10.1126/science.adz4433 (April 30, 2026)
- I’m in ophthalmology where AI diagnostics have been promised for almost a decade. We have FDA approved diagnostics for diabetic retinopathy screening that has been commercially available since 2018, and papers claiming board certified ophthalmologist level classification accuracy as far back as inceptionv3. Maybe it’s just an economic barrier but these tools still haven’t made any meaningful impact in the US. Other countries without healthcare access? It’s helpful for culling the herd, but it doesn’t fix the last mile problem of what you do when you find referable disease that needs treatment.
My philosophical take: if AI can outperform the average, it’s probably a net benefit for society that I won’t have a job. Until then, I’m going to take my income and save up for an early retirement.
- LLMs can be a useful second opinion for a highly educated patient with good insight into their health and body, but this is not the average patient I see in an urban emergency department. Many patients can't give a cohesive history without a skilled clinician who can ask the right questions and read between the lines.
I am very skeptical of studies like this that don't adequately reflect real world conditions, but when I was a software engineer I probably wouldn't have understood what "real" medicine is like either.
- I advise a medical non profit and we ran a series of tests against cases doctors input to our system looking for specialist recommendations.
Our findings found that gpt-5-mini performed better than gpt-5, sonnet 4 and medgemma.
I think these studies are very hard to accurately score. But in any case, AI seems to do a very good job compared to humans. Unsurprising, really.
by chromacity
0 subcomment
- All the other points raised in this thread aside, it seems like an odd thing to benchmark because a significant proportion of ER practice is dealing with emergencies, often accidental injuries. There's not a whole of diagnosing going on if you show up to ER with a gash on your forehead or a missing finger.
by SkiFreeWin3
0 subcomment
- Yes, but what was the overlap
- Since when do "triage doctors" attempt diagnosis, or have the expectation of doing so? They're just trying to figure out who needs to see the actual doctor first.
- How much far is 67% against 55%? Does the research considered same patients as the doctors?
How much it can be effective for science if it is not compared side by side how each scenario was evaluated by both and how it came to different conclusions.
Who can ensure a doctor couldn't spot some blind point AI couldn't at the remaining 43%.
Tools are not for replacement but combining efforts.
Throw such % to the public is a lot of irresponsibility.
- Sensitivity vs specificity
by swisniewski
0 subcomment
- Let’s assume the AI does out perform the DR.
I still want humans in the loop, interpreting the LLMs findings and providing a sanity check.
You can’t hold an LLM accountable.
That’s the min responsible bar for LLM authored code, which normally doesn’t really matter much. For something as important as ER diagnostics, having a human in the loop is crucial.
The narrative that these tools are replacing human intelligence rather than augmenting it is, quite frankly, stupid.
We should embrace these tools.
But, “eliminating DRs”… hardly.
- I can't help to visualize the scene in Idiocracy where there is an examination. The guy gets multiple wires that gets put in his hands, mouth and rectum. The guy that assists (aka the doctor) switches the wires after each person.
If we trust machines to much...
- I wonder about the nuance within the data. Like does AI do much worse with children than adults, but still better overall for example. Or biological male vs female. I think we'd want it to do better across all groups, ages etc so we're not introducing some kind of horrible bias resulting in deaths or serious health consequences for some groups
by SpyCoder77
1 subcomments
- This is a rather new article about an old model...
- The Pitt third season leak? All of the ER is fired and Robbie is fighting schizophrenia with 15 agents and Dana?
- This reminds me GPT-4 era studies where the LLM was better in a Law school exam than a student. We are not in 2023 anymore, or in the case of medicine, are we? If yes, this is bad news for health related applications as the low hanging fruits in LLM have been cut off.
- One shouldnt trust AI regarding medical matters, things can go downhill you know
by DeepYogurt
0 subcomment
- Who's accountable for the 33%?
by LeCompteSftware
1 subcomments
- It is easy to overinterpret this based on the headline, the doctors were actually at a slight disadvantage. This isn't how they normally work, this is a little more like a med school pop quiz:
An AI and a pair of human doctors were each given the same standard electronic health record to read – typically including vital sign data, demographic information and a few sentences from a nurse about why the patient was there. The AI identified the exact or very close diagnosis in 67% of cases, beating the human doctors, who were right only 50%-55% of the time.... The study only tested humans against AIs looking at patient data that can be communicated via text. The AI’s reading of signals, such as the patient’s level of distress and their visual appearance, were not tested. That means the AI was performing more like a clinician producing a second opinion based on paperwork.
"I don't know, let's run more tests" is also a very important ability of doctors that was apparently not tested here. In addition to all the normal methodological problems with overinterpreting results in AI/LLMs/ML/etc. Sadly I do think part of the problem here is cynical (even maniacal) careerist doctors who really shouldn't be working at hospitals. This means that even though I am generally quite anti-LLM, and really don't like the idea of patients interacting with them directly, I am a little optimistic about these being sanity/laziness checkers for health professionals.
- o1 has a METR time horizon of around 40 minutes, opus 4.7 has an implied horizon of 18 hours based on its ECI score. this study is on a model that's several generations behind wrt the kind of tasks it can complete. it would be shocking if this number were anywhere near as low with GPT 5.5, to the point it seems nearly totally irrelevant to talk about these results
by david_mchale
0 subcomment
- having been in ERs too many times when they are beyond capacity, something like this would be better than patients slipping through the cracks, at least you get a chance.
by getnormality
0 subcomment
- Wow, amazing. They had an AI robot running o1 look at live ER patients coming in just like a real doctor and they did that much better? Incredible! (literally)
by theshrike79
3 subcomments
- I'll repeat my idea on how this MUST be done:
1. AI gets data about the patient and makes a diagnosis. This is NOT shown to doctor yet.
2. Doctor does their stuff, writes down their diagnosis. This diagnosis is locked down and versioned.
3. Doctor sees AI's diagnosis
4. Doctor can adjust their diagnosis, BUT the original stays in the system.
This way the AI stays as the assistant and won't affect the doctor's decision, but they can change their mind after getting the extra data.
by 1980phipsi
0 subcomment
- How much time do the doctors spend to diagnose versus o1?
- I don't think AI is a good use case for such critical situations. Maybe in a decade we have AI help out doctors with doing a pre check. What if Ai finds nothing and the doctor does not bother to look into it further? It is this small question which breaks the technology from any angle later down the road from my POV. AI has to stay optional here.
Even if AI is used to sample or summarize a lot of data that a human couldn't do in time: What if it misses something that a human won't? What if a human inversely misses something that AI won't? Would you rather trust the machine or the human? (Especially if the human is held accountable.)
- Can't happen soon enough. If the bar was as high as it needed to be, there'd be like one qualified doctor on Earth so far.
- I mean an LLM is a slightly stirred up soup of current human knowledge. It has an advantage in quantity of accumulated data and maybe connecting seemingly less connected parts of that data - but not reliably. The human has an advantage (for now) in data collection (seeing, hearing sensing the patient), actual agency, real world experiences and getting the useful data out of the stirred up soup. Both human and LLM are susceptible to bias and harmful influence. Let’s simply isolate them in the diagnostic process and then compare their output. Human collects data -> both human and LLM evaluate independently -> compare the results -> human may get new insights -> final diagnosis by human.
by colechristensen
1 subcomments
- I think this is more a commentary on how bad ER diagnosis is.
- Gell-Mann Amnesia kicks in hard as soon as the LLM topic changes to a profession other than our own. It’s much easier to believe an LLM can outperform someone else doing their job than to believe that it’s a good idea to replace your own work with an LLM.
The number in the headline isn’t even a good comparison because they asked doctors to make a diagnosis from notes a nurse typed up. Doctors are trained to be conservative with diagnosing from someone else’s notes because it’s their job to ask the patient questions and evaluate the situation, whereas an LLM will happily leap to a conclusion and deliver it with high confidence
When they allowed both humans and doctors access to more information about the case, the difference between groups collapsed into statistical insignificance:
> The diagnosis accuracy of the AI – OpenAI’s o1 reasoning model – rose to 82% when more detail was available, compared with the 70-79% accuracy achieved by the expert humans, though this difference was not statistically significant.
Talking to my medical professional friends, LLMs are becoming a supercharged version of Dr. Google and WebMD that fueled a lot of bad patient self-diagnoses in the past. Now patients are using LLMs to try to diagnose themselves and doing it in a way where they start to learn how to lead the LLM to the diagnosis they want, which they can do for a hundred rounds at home before presenting to the doctor and reciting the script and symptoms that worked best to convince the LLM they had a certain condition.
- Off topic, is a “reject all and subscribe” cookie popup button legal?
I thought websites have to make it as easy to give consent as withdraw consent[1] - and here one cannot withdraw consent without an extra step (subscribing).
Instead I would expect access to the article, with same ads as in the “user consented” path, just not personalized.
[1]: “The GDPR is specific that consent must be as 'easy to withdraw as to give'”, https://en.wikipedia.org/wiki/HTTP_cookie
- Me da curiosidad, me gustaría saber si ese 33% es un subconjunto del 50-45%
Si no es un subconjunto, entonces que tan grave fue ese error? Más muertes? Más tiempo de recuperación? En qué se tradujo esa diferencia?
by gizmodo59
4 subcomments
- The negative reactions here are baffling me. The fact that we can even get to say 30% with computer is amazing. So much hatred towards AI and anything from the frontier labs like OpenAI (or Goog for that matter) makes no sense.
- radiology already had its "AI beats doctors" moment. radiologists are still here. what changed first was the workflow, not the specialty. er is probably next.
by economistbob
0 subcomment
- What we need is completely walled garden during the ER sign in process where the patient tells what they think the problem is. The things proceed as normally. We need some data to know if the patients are leas than fifty percent accurate or not.
Fifty percent accuracy. That's terrible.
by adamtaylor_13
1 subcomments
- Despite what I suspect the general consensus on HN may be, this does not surprise me at all.
My wife was recently diagnosed with Mast Cell Activation Syndrome (MCAS) after a pretty scary series of ER visits. It's a very strange and stubborn autoimmune disease that manifests with a number of symptoms that, taken individually, could indicate damn near anything.
You could almost feel the doctors rolling their eyes as she explained her symptoms and medical history.
Anyway... it lit a bit of a fire in me to dig deeper, and one day Claude suggested MCAS. I started plugging in more labs, asking for Claude to cross-reference journals mentioning MCAS, and sure enough: it's MCAS.
idk what the moral of the story is except our current medical system is a joke. The doctors aren't the villains, but they sure aren't the heroes either.
- how much confidence is 67%? does it was at the same patients with the same info? If not it is just selling bait.
- But what was the overlap?
by gamerslexus
1 subcomments
- Hold on. Does this mean ER diagnoses are marginally better than pure chance?
by lowbloodsugar
0 subcomment
- Computers have been better at this since the 80s. But the doctors have a really good union, and they’re smart enough not to call it a “union” so it sounds like it’s about standards and ethics.
- Triage deliberately diagnoses rarer conditions that would be more serious or require more urgent treatment so they can be ruled out.
- i would rather be incorrectly diagnosed by a doctor than have chudgpt treat me.
- I’ve some family in medicine and it scares me how much they now rely on AI. Some even quote it like Bible.
by SilverElfin
1 subcomments
- I’ve had much better luck with diagnosis of my own family’s issues than with doctors. Usually now, I’m feeding them more information to begin with, so that their 30 minute office visits are not wasted, requiring another expensive follow up appointment.
While I’m sure there can be ways in which such studies are wrong, it’s very obvious that AI can accelerate work in many of these areas where we seek out professional help - doctors, lawyers, etc.
- would it ever diagnose incorrectly to save more lives? kinda weird an ai would decide who die so others may survive, but i guess whatever.
by Aboutplants
0 subcomment
- Now show me the result of Triage Doctors with aided AI help
- jfc, when does this ai boosting finally stop.
by bluefirebrand
0 subcomment
- Unfortunately, from my understanding Doctors don't necessarily diagnose for accuracy, they often diagnose to limit liability.
They aren't going to take a stab at an uncommon diagnosis even if it occurs to them, if they might get sued if they're wrong.
Edit: I'm not trying to say Doctors deliberately diagnose wrong. Just that if there are two possible diagnoses, one common that matches some of the symptoms and one rare that matches all symptoms, doctors are still much more likely to diagnose the common one. Hoofbeats, horses, zebras, etc
by nikhilpareek13
0 subcomment
- [flagged]
- [flagged]
- [flagged]
- The Guardian needs to raise their bar on what to report and how to give readers full context on the ongoing NFT AI trust me bro crypto scam and that context would be that it is a mathematical model of human language and not medical expert or replacement for one.
- I’d love to see a follow to that radiologist evaluation, where it failed so miserably on the thing it was supposed to be the best at that now there’s a shortage of radiologists.
- Humans could not diagnose and treat me correctly. They almost killed me. Curious where I could feed my symptoms and the same data I gave to an ER to an AI to test it.
- As a 60yo I developed my own AI medical assistant [1] and I've used it extensively for many conditions, I can't be happier. After analyzing some lab tests it even recommended a marker that was not considered first by the doctor, so yes, it won't replace doctors but it is a very helpful tool for self-diagnosing simple conditions and second opinions.
[1] https://mediconsulta.net (DeepSeek)
- Believable and not shocking. LLMs literally may have saved my sons and potentially her mother too by allowing us to fact check a lot of non sense data and scare tactics by a group of at least 5 different doctors ambushing us to make a life changing decision in minutes. The problem is doctors, at least in the US, prioritize liability exposure over patients long term outcomes. Let’s say you need an intervention where two options A and B are available to you. A carries 1% risk of complications but a great outcome. Option B has 0.1% risk of complications but once you are discharged the short term effects are challenging and long term effects not well understood. Well, 10/10 times doctors will suggest option B and will do anything they can to nudge you into making that choice, like not telling you the absolute numbers and constantly using the word “death”. They also lie about the outcomes, because again, once you accept the procedure, sign and are sent home, they have nothing to do with you.