FRESH

Hacker News

Home

Scientific datasets are riddled with copy-paste errors

148 points by jruohonen

by TrackerFF

3 subcomments

It wouldn't surprise me one bit if many of these things can be attributed to Excel usage. I'm a "power user" of excel, and when working on larger problems with tens of sheets, smaller mistakes can easily carry on. Even more so if you're not a proficient user.
One of my first jobs as an analyst was to clean up messy spreadsheets made by people, even very senior employees, who never bothered to learn excel properly.

by coppsilgold

6 subcomments

What should give people pause is how not complicated (I'd hesitate to say easy) it would be to create a python script that would generate fake data such that it would be all but impossible to determine whether it's real or not. You just need to model the measuring device and hypothesis you want to support, then sample away.
The people who get caught red handed like this are lazy, incompetent and stupid. Makes you wonder what about the ones not getting caught.

by jcattle

1 subcomments

Just a thought: This data engineering can only really occur in sciences with a significant "moat".
Expensive tools, expensive test setups, live, gene-altered animals, etc.
In fields such as deep learning or other more digital fields (my field is using a lot of freely available satellite data) replication is often cheaper and actual application of research outcomes is a lot more common.

by steve_adams_86

6 subcomments

This is legitimately so challenging to avoid, because loads of scientific processes are—to some degrees or others—bespoke and difficult to fully streamline and introduce efficient, well-structured, comprehensive QA.
A LOT of labour goes into making it work. Most scientists I know and work with are very diligent people who care a lot about the outputs being as correct as possible, but wow, their workflows aren't great.
My job is to try and address this in whatever ways are practical for the data and the people doing the science, and it's kind of like Saas in that you think it should be easy enough to spot problems, solve them, and carry on/become a billionaire, but... The world is much more complicated than that, and it's easier to fail in this endeavour than it is to break even.
The classic "DropBox is just rsync" or "I could build Airbnb in a weekend" sentiments have their commonalities and counterparts in science, and the reality is similarly defeating and punishing on both sides. Making science go faster while maintaining correctness is exceedingly difficult. There are so many moving parts. So many disparate participants who are wildly technical and capable, or brilliant at studying bacteria in starfish yet terrified to run a command in a terminal. Your user base has virtually nothing in common in terms of ability and willingness to do anything other than get their own work done. It's brutal.
So, I sympathize with the authors of these papers and I hope readers don't assume they're bad at what they do or that it's done in bad faith. It's genuinely difficult.
An anecdote: I created a tool for validating biodiversity data against a specification called Darwin Core. Initially our published data was failing to validate so much that I thought I'd made the tool wrong. Rather, the spec is so complex and vast that the people I work with were unable to manage to get valid data into the public repositories. And yet! They were able to publish, because the public repositories' own validation is... Invalid. That's the state of things.
Granted, the data is still correct enough to be useful, and the errors don't cause the results to indicate anything that they shouldn't. It's more like minor metadata issues, failures to maintain referential integrity across different datasets, etc. But it's a very real, very difficult problem.
Science isn't easy at all. So many hoops to jump through, so much rigor, so much data. Mistakes are inevitable.

by l5870uoo9y

3 subcomments

> It could be either a fat-finger mistake when editing the Excel file or deliberate tampering to cover up real data that didn't tell the right story.
I can easily imagine after spending years or decades devoted to discovering a scientific breakthrough that some could be tempted to slightly alter the data. I believe there was some scandal about this a few years back with climate data. Fixing this is however something that AI would do fairly well.

by kriro

0 subcomment

Innocent mistakes and frustrating back and forth are also very common, especially for interdisciplinary teams. The mismatch in tooling and workflows and manual copy & paste conversion is a thing to behold. Add multiple countries and Excel to the mix (dot vs comma, formulas being language specific), maybe have a couple of Chinese, Japanese, Russian or Arabic speking researchers in the group for some extra UTF-8 magic. Line endings on Linux vs. OSX vs. Windows.

by cowartc

0 subcomment

The real rate is certainly higher because this only catches the laziest form of error. The harder problem is the same one we see in production ML. Your pipeline can produce confident results on garbage data and nothing in the system tells you. The first step isn't better models or better tools, its profiling the input before you trust anything downstream of it.

by hackeraccount

0 subcomment

I always think about that Simpsons quote about alcohol "The cause of... and solution to... all of life's problems"
That's how I feel about copy-paste. Nothing is ever so janus-faced.

by mmooss

0 subcomment

Note this exchange with the OP author:
> [Paper author:] Englund's claim that the Model 680 "records raw light measurements as it sees them without any post-processing" is incorrect. [...] It converts analog intensity signals to absorbance values using Beer's law, rounding results to the nearest 0.001 OD.
> [OP author:] Ok, that was incorrect on my part and shows my lack of knowledge about photometers. I was attempting to paraphrase an email from Bio-rad where they said: “The system [Bio-rad 680] was very basic and recorded OD as seen, it did not do any onboard manipulation of the data, it gave raw results for the user to interpret.”
For someone focusing on accuracy and sloppy work, that's a significant problem. How much of the OP is based on their "lack of knowledge" and reckless application of ignorance?
For the first issue, the section titled "Verdict" is followed by,
> the authors have so far not responded.
I would not issue verdicts without making sure I understand the other's perspective (note that it is a requirement in courts of law). That applies especially when I lack direct knowledge or expertise.
When I haven't taken that step, I've learned a thousand times that my certainty usually reflects my lack of imagination, knowledge, or careful thought. Even when I'm 'right', I'm 'wrong': even if the 'verdict' is the same, the truth differs from what I was so certain about. I used to express my certainty prematurely; now I know to keep it to myself until I know what I'm talking about, which frequently saves me from major and/or embarassing errors.

by shevy-java

0 subcomment

Not only that but sometimes at universities, they use AI to generate descriptions.
Recent example I found (semi-accidentally, I was only looking for microscopy related courses):
https://ufind.univie.ac.at/de/course.html?lv=301053&semester...
At the end of the description it has:
"Übersetzt mit DeepL.com (kostenlose Version)"
This means, in english, "translated via DeepL.com (free version)" aka the not-paid-for version. What I found baffling is that even for a single paragraph, some are too lazy to write stuff on their own - or, at the least, remove that disclaimer. Other people also pointed out that they saw this in autogenerated brochures/booklets, in the USA for instance; think I saw this about 3 months ago but I forgot which booklet it was. But the whole booklet was AI-autogenerated. To me this is all spam. I can not want to be bothered to read AI "content" when it is really just glorified slop-spam.

by X1a0Ch3n

0 subcomment

[dead]

by aaron695

0 subcomment

[dead]

by evolighting

0 subcomment

[dead]