This article gave an LLM a bunch of health metrics and then asked it to reduce it to a single score, didn't tell us any of the actual metric values, and then compared that to a doctor's opinion. Why anyone would expect these to align is beyond my understanding.
The most obvious thing that jumps out to me is that I've noticed doctors generally, for better or worse, consider "health" much differently than the fitness community does. It's different toolsets and different goals. If this person's VO2 max estimate was under 30, that's objectively a poor VO2 max by most standards, and an LLM trained on the internet's entire repository of fitness discussion is likely going to give this person a bad score in terms of cardio fitness. But a doctor who sees a person come in who isn't complaining about anything in particular, moves around fine, doesn't have risk factors like age or family history, and has good metrics on a blood test is probably going to say they're in fine cardio health regardless of what their wearable says.
I'd go so far to say this is probably the case for most people. Your average person is in really poor fitness-shape but just fine health-shape.
There's plenty of blame to go around for everyone, but at least for some of it (such as the above) I think the blame more rests on Apple for falsely representing the quality of their product (and TFA seems pretty clearly to be blasting OpenAI for this, not others like Apple).
What would you expect the behavior of the AI to be? Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it. Even disregarding statistical outliers, it's not at all clear what part of the data is "good" vs "unrealiable" especially when the company that collected that data claims that it's good data.
I will also preface this with saying I do not think any LLM is better than the average doctor and that you are far better served going to your doctor than asking ChatGPT what your health is like on any factor.
But I'll also say that the quality of doctors varies massively, and that a good amount of doctors learn what they learn in school and do not keep up with the latest advances in research, particularly those that have broad spectrums such as GPs. LLMs that search scientific literature, etc., might point you in the direction of this research that the doctors are not aware of. Or hallucinate you into having some random disease that impacts 3 out of every million people and send you down a rabbithole for months.
Unfortunately, it's difficult to resolve this without extremely good insurance or money to burn. The depth you get and the level of information that a good preventative care cardiologist has is just miles ahead of where your average family medicine practitioner is at. Statins are an excellent example - new prescriptions are for atorvastatin are still insanely high despite it being a fairly poor choice in comparison to rosuvastatin or pitavastatin for a good chunk of the people on it. They often are behind on the latest recommendations from the NLA and AHA, etc.
There's a world where LLMs or similar can empower everyday people to talk to their doctor about their options and where they stand on health, where they don't have to hope their doc is familiar with where the science has shifted over the past 5-10 years, or cough up the money for someone who specializes in it. But that's not the world of today.
In the mean time, I do think people should be comfortable being their own advocates with their doctors. I'm lucky enough that my primary care doc is open to reading the studies I send over to him on things and work with me. Or at least patient enough to humor me. But it's let me get on medications that treat my symptoms without side effects and improved my quality of life (and hopefully life/healthspan). There's also been things I've misinterpreted - I don't pick a fight with him if we come to opposite conclusions. He's shown good faith in agreeing with me where it makes sense to me, and pushed back where it hasn't, and I acknowledge he's the expert.
For it to get better, it needs to know outcomes of its diagnosis.
Are people just typing back to ChatGPT saying "you're wrong / you're right"?
You can't feed an LLM years of time-series meteorological data, and expect it to work as a specialized weather model, you can't feed it years of medical time-series and expect it to work as a model specifically trained, and validated on this specific kind of data.
An LLM generates a stream of tokens. You feed it a giant set of CSVs, if it was not RL'd to do something useful with it, it will just try to make whatever sense of it and generate something that will most probably have no strong numerical relationship to your data, it will simulate an analysis, it won't do it.
You may have a giant context windows, but attention is sparse, the attention mechanism doesn't see your whole data at the same time, it can do some simple comparisons, like figuring out that if I say my current pressure is 210X180 I should call an ER immediately. But once I send it a time-series of my twice a day blood-pressure measurements for the last 10 years, it can't make any real sense of it.
Indeed, it would have been better for the author to ask the LLM to generate a python notebook to do some data analysis on it, and then run the notebook and share the result with the doctor.
First of all, wrist based HR measurements are not reliable. If you feed ChatGPT a ton of HR data that is just plain wrong, expect a bad result. Everyone who wants to track HR reliably should invest in a chest strap. The VO2 Max calculation is heavily based on your pace at a given heart rate. It makes some generalizations on on your running biomechanics. For example, if your "real" lab tested VO2 max stays constant, but you improve your biomechanics / running efficiency, you can run faster at the same effort, and your Apple watch will increase your VO2 Max number.
Imagine if as a dev someone came to you and told you everything that is wrong with your tech stack because they copy pasted some console errors into ChatGPT. There's a reason doctors need to spend almost a decade in training to parse this kind of info. If you do the above then please do it with respect for their profession.
The basic idea was to adapt JEPA (Yann LeCun's Joint-Embedding Predictive Architecture) to multivariate time series, in order to learn a latent space of human health from purely unlabeled data. Then, we tested the model using supervised fine tuning and evaluation on on a bunch of downstream tasks, such as predicting a diagnosis of hypertension (~87% accuracy). In theory, this model could be also aligned to the latent space of an LLM--similar to how CLIP aligns a vision model to an LLM.
IMO, this shows that accuracy in consumer health will require specialized models alongside standard LLMs.
They usually require more data It is not a great idea to diagnose anything with so few information. But in general I am optimistic of the use of LLMs on health.
I'm definitely not going with Apple. Are there any minimally obtrusive trackers that provide downloadable data?
A family member recently passed away from a rare, clinically diagnosed disease. ChatGPT knew what it was a couple months before the relevant specialists diagnosed it.
... and you won't believe what happened next!
Can we do away with the clickbait from MSN? The article is about LLMs misdiagnosing cardiovascular status when given fitness tracker data
Sure, LLM companies and proponents bear responsibility for the positioning of LLM tools, and particularly their presentation as chat bots.
But from a systems point of view, it's hard to ignore the inequity and inconvenience of the US health system driving people to unrealistic alternatives.
(I wonder if anyone's gathering comparable stats on "Doctor LLM" interactions in different countries... there were some interesting ones that showed how "Doctor Google" was more of a problem in the US than elsewhere.)
At the end of the day, it’s yet another tool that people can use to help their lives. They have to use their brain. The culture of seeing doctor as a god doesn’t hold up anymore. So many people have had bad experiences when the entire health care industry at least in US is primarily a business than helping society get healthy.
Paywall-free version at https://archive.ph/k4Rxt
Look, AI Healthbros, I'll tell you quite clearly what I want from your statistical pattern analyzers, and you don't even have to pay me for the idea (though I wouldn't say no to a home or Enterprise IT gig at your startup):
I want an AI/ML tool to not merely analyze my medical info (ON DEVICE, no cloud sharing kthx), but also extrapolate patterns involving weather, location, screen time, and other "non-health" data.
Do I record taking tylenol when the barometric pressure drops? Start alerting me ahead of time so I can try to avoid a headache.
Does my screen time correlate to immediately decreased sleep scores? Send me a push notification or webhook I can act upon/script off of, like locking me out of my device for the night or dimming my lights.
Am I recording higher-intensity workouts in colder temperatures or inclement weather? Start tracking those metrics and maybe keep better track of balance readings during those events for improved mobility issue detection.
Got an app where I track cannabis use or alcohol consumption? Tie that to my mental health journal or biological readings to identify red flags or concerns about misuse.
Stop trying to replace people like my medical care team, and instead equip them with better insights and datasets they can more quickly act upon. "Subject has been reporting more negative moods in his mental health journal, an uptick in alcohol consumption above his baseline, and inconsistent cannabis use compared to prior patterns" equips the care team with a quick, verifiable blurb from larger datasets that can accelerate care and improve patient outcomes - without the hallucinations of generative AI.
I strongly dislike the author conflating HIPAA with PHI but this is a losing battle for me. And clearly editors don’t spot it, neither do AI systems - where is Clippy?! It simply serves as an indicator the author is a pretty ignorant medical consumer in the US, and this case study is stunning. Some people really should not be allowed to engage with magic.