FRESH

Hacker News

You can't fool the optimizer

213 points by HeliumHydride

by stabbles

2 subcomments

For people who enjoy these blogs, you would definitely like the Julia REPL as well. I used to play with this a lot to discover compiler things.

For example:

    $ julia
    julia> function f(n)
             total = 0
             for x in 1:n
               total += x
             end
             return total
           end
    julia> @code_native f(10)
        ...
        sub    x9, x0, #2
        mul    x10, x8, x9
        umulh    x8, x8, x9
        extr    x8, x8, x10, #1
        add    x8, x8, x0, lsl #1
        sub    x0, x8, #1
        ret
        ...

it shows this with nice colors right in the REPL.

In the example above, you see that LLVM figured out the arithmetic series and replaced the loop with a simple multiplication.

by abainbridge

8 subcomments

The examples are fun, but rather than yet another article saying how amazing optimizing compilers are (they are, I already know), I'd probably benefit more from an article explaining when obvious optimizations are missed and what to do about it.
Some boring examples I've just thought of...
eg 1:
```
    int bar(int num) { return num / 2; }
```
Doesn't get optimized to a single shift right, because the that won't work if num is negative. In this case we can change the ints to unsigneds to tell the compiler we know the number isn't negative. But it isn't always easy to express to the compiler everything you know about your data and use case. There is an art in knowing what kinds of things you need to tell the compiler in order to unlock optimizations.
eg 2:
```
    int foo(void) { return strlen("hello"); }
```
We all know that strlen will return 5, but some compilers don't: https://godbolt.org/z/M7x5qraE6
eg 3:
```
    int foo(char const *s) {
      if (strlen(s) < 3) return 0;
      if (strcmp(s, "hello") == 0)
        return 1;
      return 0;
    }
```
This function returns 1 if s is "hello". 0 otherwise. I've added a pointless strlen(). It seems like no compiler is clever enough to remove it. https://godbolt.org/z/Koj65eo5K. I can think of many reasons the compiler isn't able to spot this.

by WalterBright

0 subcomment

There are general optimizations, based on DFA (Data Flow Analysis). These recognize things like loops, loop invariants, dead code, copy propagation, constant propagation, common subexpressions, etc.
Then, there are is a (very long) list of checks for specific patterns and replacing them with shorter sequences of code, things like recognizing the pattern of bswap and replacing it with a bswap instruction. There's no end to adding patterns to check for.

by jagged-chisel

12 subcomments

I always code with the mindset “the compiler is smarter than me.” No need to twist my logic around attempting to squeeze performance out of the processor - write something understandable to humans, let the computer do what computers do.

by CodeArtisan

1 subcomments

Recursive Popcount:

    unsigned int popcount(unsigned int n) 
    {
        return (n &= n - 1u) ? (1u  + popcount(n)) : 0u;
    }

Clang 21.1 x64:

    popcount:
            mov     eax, -1
    .LBB0_1:
            lea     ecx, [rdi - 1]
            inc     eax
            and     ecx, edi
            mov     edi, ecx
            jne     .LBB0_1
            ret

GCC 15.2:

    popcount:
            blsr    edi, edi
            popcnt  eax, edi
            ret

Both compiled with -O3 -march=znver5

by Scene_Cast2

1 subcomments

This post assumes C/C++ style business logic code.
Anything HPC will benefit from thinking about how things map onto hardware (or, in case of SQL, onto data structures).
I think way too few people use profilers. If your code is slow, profiling is the first tool you should reach for. Unfortunately, the state of profiling tools outside of NSight and Visual Studio (non-Code) is pretty disappointing.

by Findecanor

2 subcomments

I'm wondering how the compiler optimised add_v3() and add_v4() though.
Was it through "idiom detection", i.e. by recognising those specific patterns, or did the compiler deduce the answers them through some more involved analysis?

by anon-3988

1 subcomments

What I am curious about is, is the compiler smart enough to be lazy with computation and or variables? For example consider:
let a = expr let b = expr2
if (a || b) { return true; }
is the compiler allowed to lazily compute this if it is indeed faster to do that way? Or declaring a bunch of variables that may or may not be used in all of the branches. Is the compiler smart enough to only compute them whenever it is necessary? AFAIK this is now allowed in C-like languages. Things have to materialize. Another one is, I like to do memcpy every single time eventhough it might not even be used or overwritten by other memcpys. Is the compiler smart enough to not perform those and reorder my program so that only the last relevant memcpy is performed?
A lot of times, my code becomes ugly because I don't trust that it does any of this. I would like t write code in consistent and simple ways but I need compilers to be much smarter than it is today.
A bad example recently is something like
const S * s =;
let a = constant; let b = constant; let c = constant; let d = constant; let e = constant; let f = constant; let g = constant; let h = constant; let i = constant; let j = constant; let k = constant; let l = constant;
if (s->a == a && s->b == b /* etc */ ) { return true; }
It did not turn all of this into a SIMD mask or something like that.

by matja

1 subcomments

You can fool the optimizer, but you have to work harder to do so:

    unsigned add(unsigned x, unsigned y) {
        unsigned a, b;
        do {
            a = x & y;
            b = x ^ y;
            x = a << 1;
            y = b;
        } while (a);
        return b;
    }

becomes (with armv8-a clang 21.1.0 -O3) :

    add(unsigned int, unsigned int):
    .LBB0_1:
            ands    w8, w0, w1
            eor     w1, w0, w1
            lsl     w0, w8, #1
            b.ne    .LBB0_1
            mov     w0, w1
            ret

by senfiaj

3 subcomments

I wonder if compilers do multiple passes on the intermediate code in order to optimize / simplify it. For example, during each pass the optimizer searches some known harcoded patterns and replaces them with something else and repeats until no possible improvement is found.
Also optimizers have a limit, they can't reason as abstractly as humans, for example:
```
  bool is_divisible_by_6(int x) {
      return x % 2 == 0 && x % 3 == 0;
  }

  bool is_divisible_by_6_optimal(int x) {
      return x % 6 == 0;
  }
```
I tried with both gcc and clang, the asm code for is_divisible_by_6 is still less optimal. So no, there are plenty of easy ways to fool the optimizer by obfuscation.
The morale is that you still have to optimize algorithms (O notation) and math operations / expressions.

by msarnoff

0 subcomment

I was very surprised that GCC could optimize NEON SIMD intrinsics. After spending hours trying to optimize my vector code, trying to get the spacing between register dependencies right to reduce stalls, breaking long reduction operations into intermediate results, messing with LLVM-MCA, etc., I realized that I just couldn’t beat the compiler. It was doing its best to allocate registers and reorder instructions to keep the pipeline filled.
I don’t think it always did the best job and saw a bunch of register spills I thought were unnecessary, but I couldn’t justify the time and effort to do it in assembly…

by jmcomets

4 subcomments

Obvious caveat: pushing this a bit further it can quickly fallback to the default case. The optimizer is a superpower but you still need to try to write efficient code.
```
    unsigned add_v5(unsigned x, unsigned y) {
      if (x == y) return 2 * x;
      return x + y;
    }
```
Results in:
```
    add_v5(unsigned int, unsigned int):
      lsl w8, w0, #1
      add w9, w1, w0
      cmp w0, w1
      csel w0, w8, w9, eq
      ret
```
(armv8-a clang 21.1.0 with O3)
If compiler folks can chime in, I'm curious why incrementing in a loop can be unrolled and inspected to optimize to an addition, but doubling the number when both operands are equal can't?

by toonewbie

0 subcomment

Sometimes you can fool the compiler :-)
See "Example 2: Tricking the compiler" in my blog post about O3 sometimes being slower than O2: https://barish.me/blog/cpp-o3-slower/

by derefr

0 subcomment

Even better / potentially more surprising:

    unsigned mult(unsigned x, unsigned y) {
        unsigned y0 = y;
        while (x--) y = add_v1(y, y0);
        return y;
    }

optimizes to:

    mult(unsigned int, unsigned int):
      madd w0, w1, w0, w1
      ret

(and this produces the same result when substituting any of the `add_vN`s from TFA)

by sureglymop

6 subcomments

With this one I instead wondered: If there are 4 functions doing exactly the same thing, couldn't the compiler also only generate the code for one of them?
E.g. if in `main` you called two different add functions, couldn't it optimize one of them away completely?
It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no? Why doesn't it do that?

by norir

0 subcomment

For me, compiler optimization is a mixed bag. On the one hand, they can facilitate the generation of higher performance runtime artifacts, but it comes at significant cost, often I believe exceeding the value they provide. They push programs in the direction of complexity and inscrutability. They make it harder to know what a function _actually_ does, and some even have the ability to break your code.
In the OP examples, instead of optimization, what I would prefer is a separate analysis tool that reports what optimizations are possible and a compiler that makes it easy to write both high level and machine code as necessary. Now instead of the compiler opaquely rewriting your code for you, it helps guide you into writing optimal code at the source level. This, for me, leads to a better equilibrium where you are able to express your intent at a high level and then, as needed, you can perform lower level optimizations in a transparent and deterministic way.
For me, the big value of existing optimizing compilers is that I can use them to figure out what instructions might be optimal for my use case and then I can directly write those instructions where the highest performance is needed. But I do not need to subject myself to the slow compilation times (which compounds as the compiler repeatedly reoptimizes the same function thousands of times during development -- a cost that is repeated with every single compilation of the file) nor the possibility that the optimizer breaks my code in an opaque way that I won't notice until something bad and inscrutable happens at runtime.

by amelius

0 subcomment

One undesirable property of optimizers is that in theory one day they produce good code and the next day they don't.

by 317070

1 subcomments

"The compiler" and "The optimizer" are doing a lot of the heavy lifting here in the argument. I definitely know compilers and optimizers which are not that great. Then again, they are not turning C++ code into ARM instructions.
You absolutely can fool a lot of compilers out there! And I am not only looking at you, NVCC.

0 subcomment

by torginus

0 subcomment

Awesome blog post - thanks to this I found out that you can view what the LLVM optimizer pipeline does, and which pass is actually responsible for doing which instruction.
It's super cool to see this in practice, and for me it helps putting more trust in the compiler that it does the right thing, rather than me trying to micro-optimize my code and peppering inline qualifiers everywhere.

by gpderetta

0 subcomment

Interesting, even this can't fool the optimizer (tried with a recent gcc and clang):

  unsigned add(unsigned x, unsigned y) {
   std::vector vx {x};
   std::vector vy {y};
   auto res = vx[0]+vy[0];
   return res;
  }

by Joker_vD

1 subcomments

Wait, why does GAS use Intel syntax for ARM instead of AT&T? Or something that looks very much like it: the destination is the first operand, not the last, and there is no "%" prefix for the register names?

by asah

0 subcomment

I want an AI optimization helper that recognizes patterns that could-almost be optimized if I gave it a little help, e.g. hints about usage, type, etc.

by Scubabear68

1 subcomments

I liked the idea behind this post, but really the author fairly widely missed the mark in my opinion.
The extent to which you can "fool the optimizer" is highly dependent on the language and the code you're talking about. Python is a great example of a language that is devilishly hard to optimize for precisely because of the language semantics. C and C++ are entirely different examples with entirely different optimization issues, usually which have to do with pointers and references and what the compiler is allowed to infer.
The point? Don't just assume your compiler will magically make all your performance issues go away and produce optimal code. Maybe it will, maybe it won't.
As always, the main performance lessons should always be "1) Don't prematurely optimize", and "2) If you see perf issues, run profilers to try to definitively nail where the perf issue is".

by mkornaukhov

0 subcomment

Better tell me how to make the compiler not fool me!

by dlenski

0 subcomment

Today I learned that Matt Godbolt is British!

by raverbashing

0 subcomment

I'm curious what is the theoreme-proving magic behind add_v4 and if this is prior LLVM ir

by daft_pink

1 subcomments

Is this an argument for compiled code?