The failure mode that bit us hardest: agents sharing a single context window where early tool outputs pollute later reasoning. Fixed it by treating each agent turn as append-only -- worker writes output to a structured log, reviewer reads only that log (not the raw conversation history). Isolated drift. Night and day.
The confidence score idea is underutilized. We log tool call outcomes as: {action, result, confidence: 0-3}. The reviewer agent pattern-matches on low-confidence streaks before they compound into something unfixable.
On the multi-model review question from another commenter: different models catch different failure types. Claude catches logical inconsistencies; a smaller/faster model catches format errors and incomplete outputs. Cheap pre-check before the expensive reviewer saves a lot of token burn.
What's your retry strategy when the reviewer blocks -- exponential backoff on the same worker context, or fresh context each retry? We do fresh context after 2 failures.
The key thing to get right: make the retry idempotent. If worker retries the same task, it should produce the same side effects as a fresh run, not double them. This is harder than it sounds when agents are calling real APIs or writing files.
How does OpenSwarm handle the case where worker keeps failing reviewer? Is there a max retry count, and if so, what happens to the Linear issue?
the failure mode I'd worry about most is cascading context drift, where each agent in the chain slightly misunderstands the task and by the time you get to the test agent it's validating the wrong thing entirely. fwiw I think the LanceDB memory is the right call for this kind of setup, keeping shared context grounded is probably what prevents most of those drift issues.