'Comically bad' datasets used to train clinical models for stroke and diabetes
63 points by leephillips
by Legend2440
4 subcomments
A lot of researchers think their job is to build models. They don't want to collect their own data, so they go find whatever dataset they can on kaggle or from a previous paper or wherever.
This is backwards. The model is the easy part. Getting good data is 99% of the job, and nearly any clown can make a good model once you hand them a good dataset.
by matusp
0 subcomment
Dataset quality is a huge issue in ML in general. You can often list a few dozen random samples from any given dataset and you will find out something weird going on instantly.