FRESH

Hacker News

We're running out of benchmarks to upper bound AI capabilities

15 points by gmays

by nikisweeting

0 subcomment

We can definitely make harder evals, the problem is a good eval set is indistinguishable from good training data / market edge, so no one is incentivized to share their best eval sets publicly.

by WarmWash

1 subcomments

Start front loading the models with 5k, 10k, 50k, 100k tokens of messy quasi related context, and then run the benchmarks.
These models are ridiculously powerful with a blank slate. It's when they get loaded down with all the necessary (and inevitably unnecessary) context to complete the task that they really start to crumble and fold.

by UltraSane

1 subcomments

This is the least true thing ever. All LLMs are terrible at ARC-AGI-3. Every video game can be used as a benchmark. You could rank LLMs on how long they can keep a game of Dwarf Fortress running or how fast they can beat GTA5.

by refactorbench

0 subcomment

[dead]