By "evaluator" (aka "eval”), we did indeed mean frameworks for evaluating agent outputs broadly. The article and experiments center on LLM-as-a-judge, where an LLM is the grader, but the argument is ultimately statistical, so it holds regardless of whether the grader is an LLM, a simple supervised model, a set of regex checks, etc.
We were banking on readers being familiar with evals and left out definitions for conciseness, but as Gregaros points out, we could have been more explicit about what we meant.