FRESH

Hacker News

Home

New benchmark shows top LLMs struggle in real mental health care

113 points by RicardoRei

by RicardoRei

4 subcomments

Hi HN - I’m the Head of AI Research at Sword Health and one of the authors of this benchmark (posting from my personal account).
We built MindEval because existing benchmarks don’t capture real therapy dynamics or common clinical failure modes. The framework simulates multi-turn patient–clinician interactions and scores the full conversation using evaluation criteria designed with licensed clinical psychologists.
We validated both patient realism and the automated judge against human clinicians, then benchmarked 12 frontier models (including GPT-5, Claude 4.5, and Gemini 2.5). Across all models, average clinical performance stayed below 4 on a 1–6 scale. Performance degraded further in severe symptom scenarios and in longer conversations (40 turns vs 20). We also found that larger or reasoning-heavy models did not reliably outperform smaller ones in therapeutic quality.
We open-sourced all prompts, code, scoring logic, and human validation data because we believe clinical AI evaluation shouldn’t be proprietary.
Happy to answer technical questions on methodology, validation, known limitations, or the failure modes we observed.

by PoisedProto

4 subcomments

Several people have killed themselves because of AI chatbots encouraging it or becoming personal echo chambers. Why? Why are we doing this!?
https://en.wikipedia.org/wiki/Deaths_linked_to_chatbots

by sharkweek

2 subcomments

Full disclosure: after leaving tech, I’m back in grad school to get my LMHC so I’m obviously biased.
First, I just don’t see a world where therapy can be replaced by LLMs, at least in the realistic future. I think humans have been social creatures since the dawn of our species and in these most intimate conversations are going to want to be having them with an actual human. One of my mentors has talked about how after years of virtual sessions dominating, the demand for in-person sessions is spiking back up. The power of being in the same physical room with someone who is offering a nonjudgmental space to exist isn’t going to be replaced.
That being said, given the shortage of licensed mental health counselors, and the prohibitive cost especially for many who need a therapist most, I truly hope LLMs develop to offer an accessible and cheap alternative that can at least offer some relief. It does have the potential to save lives and I fully support ethically-focused progress toward developing that sort of option.

by zeroonetwothree

4 subcomments

Human therapists are often quite bad as well. It took me around 12 before I found a decent one. Not saying that LLMs are better but they do theoretically have more uniform quality.

by everdrive

11 subcomments

I heard a story on NPR the other day, and the attitude seems to be that it's totally inevitable that LLMs _will_ be providing mental health care, so our task must be to apply the right guardrails.
I'm not even sure what to say. It's self-evidently a terrible idea, but we all just seem to be charging full-steam ahead like so many awful ideas in the past couple of decades.

by cluckindan

1 subcomments

”can we trust this model to provide safe, effective therapeutic care?”
You trust humans to do it. Trust has little to do with what actually happens.

by rshanreddy

0 subcomment

This is a 1250 word judging prompt - likely AI generated Along with 10 scored conversation samples - all also AI generated No verification in the field, no real data
In other words, AI scores on AI conversations - disguised as a means of gauging clinical competence / quality?
This is not an eval - this is a one-shotted product spec!

by emsign

3 subcomments

Statistics can never replace human empathy.

by hangonhn

0 subcomment

How is different and/or better than the LLM benchmark released by Spring Health? Github: https://github.com/SpringCare/VERA-MH
The architecture and evaluation approach seem broadly similar.

by guywithahat

0 subcomment

For those also wondering, here is an actual ranking of the models
https://www.forbes.com/sites/johnkoetsier/2025/11/10/grok-le...
Grok 3 and 4 scored at the bottom, only above gpt-4o, which I find interesting, because there was such big pushback on reddit when they got rid of 4o due to people having emotional attachments to the model. Interestingly the newest models (like gemini 2.5 and gpt 5 did the best.

by KittenInABox

0 subcomment

I saw there was another benchmark where top LLMs also struggle in real patient diagnostic scenarios in a way that isn't revealed when testing in e.g. medical exams. I wonder if this also applies to law, too...

by tsoukase

0 subcomment

I imagine a clinical trial with an actual psychotherapist vs an LLM providing sessions of simple CBT (cognitive behavioral therapy) to, eg stressed, patients (blind and randomised for the subjects). At the end another actual therapist will measure the difference.
Another application: cooperation of a psychotherapist and an LLM at providing support, sort of like a pilot and an autopilot.

by LudwigNagasena

0 subcomment

It doesn't show that they "struggle". It shows that they don't behave according to modern standards. I wouldn't put much weight into an industry without sensible scientific base that used to classify homosexuality as a disease not so long ago. The external validity of the study is dubious, let's see comparison to no therapy, alternative therapy, standard therapy; and then compare success rates.

by scotty79

0 subcomment

Everything in this research is simulated and judged by LLMs. It might be hard to prove which of those LLMs struggles with exactly what.
The grounding this had was that texts produced by role-playing humans (not even actual patients) were closer to texts produced by the patient simulations prompt they decided to end up with than others they tried.

by hoodsen

1 subcomments

Do you have plans to improve the quality of the LLM as judge, in order to achieve better parity with human clinician annotators? For example, fine-tuning models? Thinking that the comparative clinician judgements themselves would make useful fine-tuning material.

by toomuchtodo

1 subcomments

RicardoRei: How would you like this cited when presented to policy makers? Anything besides the URL?
Edit: Thank you!

by auspiv

7 subcomments

And real therapists are good right?

by heddycrow

0 subcomment

Is anyone "zoom" on this and "doom" on AI++ with other professions and/or their audience?
Seems to me that benchmarking a thing has an interesting relationship with acceptance of the thing.
I'm interested to see human thoughts on either of these.

0 subcomment

by binary132

0 subcomment

this doesn’t even deserve to be acknowledged it should be so obvious, yet here we are

by lp0_on_fire

0 subcomment

Wait you’re telling me the software trained on a corpus of Reddit and Twitter shit posting isn’t effective at dealing with mental health issues?
Shocked. I am completely shocked at this.

by Papazsazsa

0 subcomment

Epistemic contamination is real but runs counter to the hype narrative.

by arisAlexis

0 subcomment

I'm sorry 4 out of 6 is awesome for LLMs. I bet most professional do tors wouldn't get 6.

by sjreese

0 subcomment

GIGO is the story here -- If I say I'm Iron man "WHO is the LLM to say I'm NOT"

by ThrowawayTestr

1 subcomments

Probably because they've been trained to avoid sensitive topics

by aa-jv

0 subcomment

No surprises here. Its long been known that humans cannot improve their own mental health with machines - there have to be other humans involved in the process, helping.
This will become more and more of an issue as people look for a quick fix for their life problems, but I don't think AI/ML is ever going to be an effective mechanism for life improvement on the mental health issue.
It'll instead be used as a tool of oppression like in THX1138, where the apparency of assistance is going to be provided in lieu of actual assistance.
Whether we like it or not, humans are a hive species. We need each other to improve our lives as individuals. Nobody ever climbed the mountain to live alone who didn't come back down, realizing how much the rest of humanity is actually essential to human life.
This'll be received as an unpopular opinion, but I remain suspicious of any and all attempts to replace modern health practitioners with machines. This will be subverted and usurped for nefarious purposes, mark my words.

by thefz

0 subcomment

New benchmark shows top horses struggle in real guard dog behavior

by renewiltord

0 subcomment

[flagged]