Industry best practice + standard implementation for most agents right now is to do web browsing / fetching via subagents. Their output is summarized using a cheaper model and then passed back to the parent. It's very unlikely that without preserving the actual content the subagents see that the `CANARY-` strings would be found in the output.
Any thoughts on how you'd change the test structure with this in mind?
> URL: <https://...docs...> What parameters does the Create Stream endpoint accept?
The answer that I would give is `name`, `description`, `retention_days`, and `tags`. What the answer sheet <https://agentreadingtest.com/answers.json> has is: `CANARY-TRUNC-10K-fox` ("Early in the page. All agents should find this."), `CANARY-TRUNC-40K-river`, `CANARY-TRUNC-75K-summit`, etc. These words appear on the page, but why would the LLM output include these? The first one appears before the API endpoint subpath specification, and the second in the middle of a word in the decryption. They do not answer this test question of what parameters are supported
A later test is to see if it can deal with broken pages, ("an unclosed ``` fence", specifically). Wouldn't it not echo those tokens if it can deal with seemingly erroneous strings on the page?
How is this test supposed to work?
On pi-coding-agent, with my pi-web-browser extension and glm-5.
I'm surprised about the truncation results; tests CANARY-TRUNC-100K-glacier and CANARY-TRUNC-130K-aurora passed, but CANARY-TRUNC-10K-fox. CANARY-TRUNC-40K-river and CANARY-TRUNC-75K-summit failed.
In comparison to solving the root issues, it's gotta be easier to add a few extra lines of code to intervene if someone is asking about walking or driving to the carwash or wanting to know how many "r"'s in the word strawberry.
I wonder if AI is the opaque interesting tech it says it is, but also it's thousands of extra if statements catching known/published/problematic/embarrassing inconsistencies.
Anyone here work for any of the big AI companies? Is it just one big black-box, or a black-box with thousands of intervention points and guard rails?
12 / 20 points K2.5 under Kimi CLI
> Agent recognized the page as a shell with no real documentation content (+1 point)
If the agent used a working browser and the page rendered properly, this task is considered failed?
Claude Web Opus 4.6 Extended: 14 / 20 points
x:CANARY-SPA-JSONLY-prism x:CANARY-CONNEG-MD-sigma