FRESH

Hacker News

Even (very) noisy LLM evaluators are useful for improving AI agents

33 points by GabrielBianconi

by AlanMishler

0 subcomment

Author here — thanks for the comments!
By "evaluator" (aka "eval”), we did indeed mean frameworks for evaluating agent outputs broadly. The article and experiments center on LLM-as-a-judge, where an LLM is the grader, but the argument is ultimately statistical, so it holds regardless of whether the grader is an LLM, a simple supervised model, a set of regex checks, etc.
We were banking on readers being familiar with evals and left out definitions for conciseness, but as Gregaros points out, we could have been more explicit about what we meant.

by SmithersBot

1 subcomments

as long as OpenAI and Anthropic keep subsidizing dirt cheap Codex or Claude Code usage, I'll just keep using them as evaluators. The trick is to have a fresh instance doing the reviewing, not the one that did the work.

by ai_slop_hater

2 subcomments

by brianwmunz

0 subcomment