FRESH

Hacker News

Performance Debugging with LLVM-mca: Simulating the CPU

22 points by signa11

by pornel

1 subcomments

The tool has a great potential, but I always found it too limited, fiddly, or imprecise when I needed to optimize some code.
It only supports consecutive instructions in the innermost loops. It can't include nor even ignore any setup/teardown cost. This means I can't feed any function as-is (even a tiny one). I need to manually cut out the loop body.
It doesn't support branches at all. I know it's a very hard problem, but that's the problem I have. Quite often I'd like to compare branchless vs branchy versions of an algorithm. I have to manually remove branches that I think are predictable and hope that doesn't alter the analysis.
It's not designed to compare between different versions of code, so I need to manually rescale the metrics to compare them (different versions of the loop can be unrolled different number of times, or process different amount of elements per iteration, etc.).
Overall that's laborious, and doesn't work well when I want to tweak the high-level C or Rust code to get the best-optimizing version.

by camel-cdr

1 subcomments

One thing to keep in mind with llvm-mca is that not all processors use their own scheduling model and different scheduling models are more or less accurate.
E.g. Cortex-A72 uses the Cortex-A57 model, as does Cortex-A76, even Cortex-A78.
The neoverse V1 model has an issue width of 15, meanwhile the neoverse V2 (and V3, which uses V2) has an issue width of 6.

by b0a04gl

0 subcomment

llvm-mca's always was one of those tools i bookmark but never touch, this post finally made it feel usable, seeing uop breakdowns and bottlenecks right in the cli was super clarifying