"Both the fetch+decode and op cache pipelines can be active at the same time, and both feed into the in-order micro-op queue. Zen 4 could use its micro-op queue as a loop buffer, but Zen 5 does not. I asked why the loop buffer was gone in Zen 5 in side conversations. They quickly pointed out that the loop buffer wasn’t deleted. Rather, Zen 5’s frontend was a new design and the loop buffer never got added back. As to why, they said the loop buffer was primarily a power optimization. It could help IPC in some cases, but the primary goal was to let Zen 4 shut off much of the frontend in small loops. Adding any feature has an engineering cost, which has to be balanced against potential benefits. Just as with having dual decode clusters service a single thread, whether the loop buffer was worth engineer time was apparently “no”."
If so, it might be a classic case of "Team of engineers spent months working on new shiny feature which turned out to not actually have any benefit, but was shipped anyway, possibly so someone could save face".
I see this in software teams when someone suggests it's time to rewrite the codebase to get rid of legacy bloat and increase performance. Yet, when the project is done, there are more lines of code and performance is worse.
In both cases, the project shouldn't have shipped.
With more detailed power measurements, it could be possible to determine if this is thermal/power budget related? It does sound like the feature was intended to conserve power…
That being said, some workloads will see a small regression, however AMD has made some small performance improvements since launch.
They should have just made it a BIOS option for Zen 4. The fact they do not appear to have done so does indicate the possibility of a bug or security issue.
Energy used per instruction is almost certainly the metric that should be considered to see the benefits of this loop buffer, not energy used per second (power, watts).