FRESH

Hacker News

Top AI models fail at >96% of tasks

24 points by codexon

by codexon

1 subcomments

This paper creates a new benchmark comprised of real remote work tasks sourced from the remote working website Upwork. The best commercial LLMs like Opus, GPT, Gemini, and Grok were tested.
Models released a few days ago, Opus 4.6 and GPT 5.3, haven't been tested yet, but given the performance on other micro-benchmarks, they will probably not be much different on this benchmark.

by tessitore

0 subcomment

This post really should be edited to say 96% of tasks posted on Upwork. Since we would all expect that to happen.

by zb3

1 subcomments

You think they don't? You think AI can replace programmers, today?
Then go ahead and use AI to fix this: https://gitlab.gnome.org/GNOME/mutter/-/issues/4051

by

0 subcomment

by Venn1

0 subcomment

ChatGPT: when you want spellcheck to argue with you.

by scotty79

1 subcomments

Kinda sus that least known model did best and none of the more recent models were tested. Capabilities grow very fast. So things that now routinely succeed rarely ever succeeded even half a year ago.

by

0 subcomment