FRESH

Hacker News

Towards a science of scaling agent systems: When and why agent systems work

104 points by gmays

by zkmon

1 subcomments

> We found that independent multi-agent systems (agents working in parallel without talking) amplified errors by 17.2x
The paper sounds too shallow. The errors data doesn't seem to have a rationale or correlation against the architecture. Specifically, what makes the SAS architecture to have lowest error rates while the similar architecture with independent agents having highest error rates? The conclusion doesn't seem well-grounded with reasoning.

by 0xbadcafebee

1 subcomments

> Conversely, on tasks requiring strict sequential reasoning (like planning in PlanCraft), every multi-agent variant we tested degraded performance by 39-70%. In these scenarios, the overhead of communication fragmented the reasoning process, leaving insufficient "cognitive budget" for the actual task.
> As tasks require more tools (e.g., a coding agent with access to 16+ tools), the "tax" of coordinating multiple agents increases disproportionately.
This aligns well the principle of highly cohesive, loosely coupled design for software components. If you instruct the AI to design this way, it should result in components that're simpler to reason about, and require fewer sequential steps to work on. You can think of cohesion in many different ways, but one is common functions, and another is tool/library dependency.

by ayushl

0 subcomment

"All MAS and SAS configurations were matched for total reasoning-token budget (mean 4,800 tokens per trial)"
A single-agent system (SAS) uses this budget for a deep, unified reasoning stream (averaging 7.2 turns), multi-agent teams would fragment the same budget into dozens of coordination messages
I wonder if the budget is increased (say 50k) would the same results be observed ?

by localghost3000

1 subcomments

I’ve been building a lot of agent workflows at my day job. Something that I’ve found a lot of success with when deciding on an orchestration strategy is to ask the agent what they recommend as part of the planning for phase. This technique of using the agent to help you improve its performance has been a game changer for me in leveraging this tech effectively. YMMV of course. I mostly use Claude code so who knows with the others.

by with

0 subcomment

It's true that most problems can be solved with context + prompt. I have actively seen teams within large organizations complicate it into complex "agentic orchestration" just to impress leadership who lack the expertise to realize it's not even necessary. Hell, there are various startups who make this their moat.
Good for promo projects though, lol

by CuriouslyC

1 subcomments

This is a neat idea but there are so many variables here that it's hard to make generalizations.
Empirically, a top level orchestrator that calls out to a planning committee, then generates a task-dag from the plan which gets orchestrated in parallel where possible is the thing I've seen put in the best results in various heterogeneous environments. As models evolve, crosstalk may become less of a liability.

by kioku

0 subcomment

I found the captions on Figure 1 quite interesting.
> Average performance (%) across four agentic benchmarks improves consistently with increasing model Intelligence Index.
> Centralized and hybrid coordination generally yield superior scaling efficiency, suggesting that collaborative agentic structures amplify capability gains more effectively than individual scaling alone.
Then again, the deltas between SAS and best performing MAS approach are ~8%, so I can't help wonder if it's worth the extra cost, at least for the generation of models that was studied.

by numpad0

0 subcomment

It feels like everyone these days are thinking of Markdown IPC hierarchical multi agent orchestration. Just the other day I saw this[1] vibecoded thing. I wonder if there's any ones notable, or maybe I should try my hands at it.
1: https://github.com/yohey-w/multi-agent-shogun

by AuthAuth

0 subcomment

>we developed a predictive model (R^2 = 0.513) that uses measurable task properties like tool count and decomposability to predict which architecture will perform best.
Is this going to be released for general use?

by Falimonda

3 subcomments

I've been building something in this space ("Clink" - multi-agent coordination layer) and this research confirms some of the assumptions that motivated the project. You can't just throw more agents at a problem and expect it to get better.
The error amplification numbers are wild! 17x for independent agents vs 4x with some central coordination. Clink provides users (and more importantly their agents) the primitives to choose their own pattern.
The most relevant features are...
- work queues with claim/release for parallelizable tasks - checkpoint dependencies when things need to be sequential - consensus voting as a gate before anything critical happens
The part about tool count increasing coordination overhead is interesting too. I've been considering exposing just a single tool to address this, but I wonder how this plays out as people start stacking more MCP servers together. It feels like we're all still learning what works here. The docs are at https://docs.clink.voxos.ai if anyone wants to poke around!

0 subcomment

by maxdo

0 subcomment

0 subcomment

by pevansgreenwood

0 subcomment

by detroitwebsites

2 subcomments

by verdverm

1 subcomments

gonna read this with a grain of salt because I have been rather unimpressed with Google's Ai products, save direct API calls to gemini
The rest is trash they are forcing down our throats