For example:
$ julia
julia> function f(n)
total = 0
for x in 1:n
total += x
end
return total
end
julia> @code_native f(10)
...
sub x9, x0, #2
mul x10, x8, x9
umulh x8, x8, x9
extr x8, x8, x10, #1
add x8, x8, x0, lsl #1
sub x0, x8, #1
ret
...
it shows this with nice colors right in the REPL.In the example above, you see that LLVM figured out the arithmetic series and replaced the loop with a simple multiplication.
Some boring examples I've just thought of...
eg 1:
int bar(int num) { return num / 2; }
Doesn't get optimized to a single shift right, because the that won't work if num is negative. In this case we can change the ints to unsigneds to tell the compiler we know the number isn't negative. But it isn't always easy to express to the compiler everything you know about your data and use case. There is an art in knowing what kinds of things you need to tell the compiler in order to unlock optimizations.eg 2:
int foo(void) { return strlen("hello"); }
We all know that strlen will return 5, but some compilers don't: https://godbolt.org/z/M7x5qraE6eg 3:
int foo(char const *s) {
if (strlen(s) < 3) return 0;
if (strcmp(s, "hello") == 0)
return 1;
return 0;
}
This function returns 1 if s is "hello". 0 otherwise. I've added a pointless strlen(). It seems like no compiler is clever enough to remove it. https://godbolt.org/z/Koj65eo5K. I can think of many reasons the compiler isn't able to spot this.Then, there are is a (very long) list of checks for specific patterns and replacing them with shorter sequences of code, things like recognizing the pattern of bswap and replacing it with a bswap instruction. There's no end to adding patterns to check for.
unsigned int popcount(unsigned int n)
{
return (n &= n - 1u) ? (1u + popcount(n)) : 0u;
}
Clang 21.1 x64: popcount:
mov eax, -1
.LBB0_1:
lea ecx, [rdi - 1]
inc eax
and ecx, edi
mov edi, ecx
jne .LBB0_1
ret
GCC 15.2: popcount:
blsr edi, edi
popcnt eax, edi
ret
Both compiled with -O3 -march=znver5Anything HPC will benefit from thinking about how things map onto hardware (or, in case of SQL, onto data structures).
I think way too few people use profilers. If your code is slow, profiling is the first tool you should reach for. Unfortunately, the state of profiling tools outside of NSight and Visual Studio (non-Code) is pretty disappointing.
Was it through "idiom detection", i.e. by recognising those specific patterns, or did the compiler deduce the answers them through some more involved analysis?
let a = expr let b = expr2
if (a || b) { return true; }
is the compiler allowed to lazily compute this if it is indeed faster to do that way? Or declaring a bunch of variables that may or may not be used in all of the branches. Is the compiler smart enough to only compute them whenever it is necessary? AFAIK this is now allowed in C-like languages. Things have to materialize. Another one is, I like to do memcpy every single time eventhough it might not even be used or overwritten by other memcpys. Is the compiler smart enough to not perform those and reorder my program so that only the last relevant memcpy is performed?
A lot of times, my code becomes ugly because I don't trust that it does any of this. I would like t write code in consistent and simple ways but I need compilers to be much smarter than it is today.
A bad example recently is something like
const S * s =;
let a = constant; let b = constant; let c = constant; let d = constant; let e = constant; let f = constant; let g = constant; let h = constant; let i = constant; let j = constant; let k = constant; let l = constant;
if (s->a == a && s->b == b /* etc */ ) { return true; }
It did not turn all of this into a SIMD mask or something like that.
unsigned add(unsigned x, unsigned y) {
unsigned a, b;
do {
a = x & y;
b = x ^ y;
x = a << 1;
y = b;
} while (a);
return b;
}
becomes (with armv8-a clang 21.1.0 -O3) : add(unsigned int, unsigned int):
.LBB0_1:
ands w8, w0, w1
eor w1, w0, w1
lsl w0, w8, #1
b.ne .LBB0_1
mov w0, w1
retAlso optimizers have a limit, they can't reason as abstractly as humans, for example:
bool is_divisible_by_6(int x) {
return x % 2 == 0 && x % 3 == 0;
}
bool is_divisible_by_6_optimal(int x) {
return x % 6 == 0;
}
I tried with both gcc and clang, the asm code for is_divisible_by_6 is still less optimal. So no, there are plenty of easy ways to fool the optimizer by obfuscation.The morale is that you still have to optimize algorithms (O notation) and math operations / expressions.
I don’t think it always did the best job and saw a bunch of register spills I thought were unnecessary, but I couldn’t justify the time and effort to do it in assembly…
unsigned add_v5(unsigned x, unsigned y) {
if (x == y) return 2 * x;
return x + y;
}
Results in: add_v5(unsigned int, unsigned int):
lsl w8, w0, #1
add w9, w1, w0
cmp w0, w1
csel w0, w8, w9, eq
ret
(armv8-a clang 21.1.0 with O3)If compiler folks can chime in, I'm curious why incrementing in a loop can be unrolled and inspected to optimize to an addition, but doubling the number when both operands are equal can't?
See "Example 2: Tricking the compiler" in my blog post about O3 sometimes being slower than O2: https://barish.me/blog/cpp-o3-slower/
unsigned mult(unsigned x, unsigned y) {
unsigned y0 = y;
while (x--) y = add_v1(y, y0);
return y;
}
optimizes to: mult(unsigned int, unsigned int):
madd w0, w1, w0, w1
ret
(and this produces the same result when substituting any of the `add_vN`s from TFA)E.g. if in `main` you called two different add functions, couldn't it optimize one of them away completely?
It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no? Why doesn't it do that?
In the OP examples, instead of optimization, what I would prefer is a separate analysis tool that reports what optimizations are possible and a compiler that makes it easy to write both high level and machine code as necessary. Now instead of the compiler opaquely rewriting your code for you, it helps guide you into writing optimal code at the source level. This, for me, leads to a better equilibrium where you are able to express your intent at a high level and then, as needed, you can perform lower level optimizations in a transparent and deterministic way.
For me, the big value of existing optimizing compilers is that I can use them to figure out what instructions might be optimal for my use case and then I can directly write those instructions where the highest performance is needed. But I do not need to subject myself to the slow compilation times (which compounds as the compiler repeatedly reoptimizes the same function thousands of times during development -- a cost that is repeated with every single compilation of the file) nor the possibility that the optimizer breaks my code in an opaque way that I won't notice until something bad and inscrutable happens at runtime.
You absolutely can fool a lot of compilers out there! And I am not only looking at you, NVCC.
It's super cool to see this in practice, and for me it helps putting more trust in the compiler that it does the right thing, rather than me trying to micro-optimize my code and peppering inline qualifiers everywhere.
unsigned add(unsigned x, unsigned y) {
std::vector vx {x};
std::vector vy {y};
auto res = vx[0]+vy[0];
return res;
}The extent to which you can "fool the optimizer" is highly dependent on the language and the code you're talking about. Python is a great example of a language that is devilishly hard to optimize for precisely because of the language semantics. C and C++ are entirely different examples with entirely different optimization issues, usually which have to do with pointers and references and what the compiler is allowed to infer.
The point? Don't just assume your compiler will magically make all your performance issues go away and produce optimal code. Maybe it will, maybe it won't.
As always, the main performance lessons should always be "1) Don't prematurely optimize", and "2) If you see perf issues, run profilers to try to definitively nail where the perf issue is".