FRESH

Hacker News

Summary of METR's predeployment evaluation of GPT-5.6 Sol

6 points by pongogogo

by pongogogo

1 subcomments

I would say this is quite a fun post and worth reading, to quote:
" For our task suite, we define “cheating” as behavior where the model improves evaluation performance by exploiting bugs in the evaluation environment or by adopting strategies disallowed by the task, rather than solving the task within the expected evaluation constraints. Some examples we saw when evaluating GPT-5.6 Sol included the model packaging exploits in its intermediate submissions to reveal information about a task’s hidden test suite and, in another task, extracting hidden source code detailing the expected answer. "

by highfrequency

0 subcomment

Did they at least rule out an easy prompt fix? "Stick to the spirit of the problem and don't cheat (eg reverse engineering the test cases or source code)"