I am very sympathetic to wanting nice static binaries that can be shipped around as a single artifact[0], but... surely at some point we have to ask if it's worth it? If nothing else, that feels like a little bit of a code smell; surely if your actual executable code doesn't even fit in 2GB it's time to ask if that's really one binary's worth of code or if you're actually staring at like... a dozen applications that deserve to be separate? Or get over it the other way and accept that sometimes the single artifact you ship is a tarball / OCI image / EROFS image for systemd[1] to mount+run / self-extracting archive[2] / ...
[0] Seriously, one of my background projects right now is trying to figure out if it's really that hard to make fat ELF binaries.
[1] https://systemd.io/PORTABLE_SERVICES/
[2] https://justine.lol/ape.html > "PKZIP Executables Make Pretty Good Containers"
Regardless of whether you're FAANG or not, nothing you're running should require an executable with a 2 GB large .text section. If you're bumping into that limit, then your build process likely lacks dead code elimination in the linking step. You should be using LTO for release builds. Even the traditional solution (compile your object files with -ffunction-sections and link with --gc-sections) does a good job of culling dead code at function-level granularity.
Move all the hot BBs near each other, right?
Facebook's solution: https://github.com/llvm/llvm-project/blob/main/bolt%2FREADME...
Google's:
https://lists.llvm.org/pipermail/llvm-dev/2019-September/135...
I see this often even in communities of software engineers, where people who are unaware of certain limitations at scale will announce that the research is unnecessary
Makes sense, but in the assembly output just after, there is not a single JMP instruction. Instead, CALL <immediate> is replaced with putting the address in a 64-bit register, then CALL <register>, which makes even more sense. But why mention the JMP thing then? Is it a mistake or am I missing something? (I know some calls are replaced by JMP, but that's done regardless of -mcmodel=large)
(I wonder but have no particular insight into if LTO builds can do smarter things here -- most calls are local, but the handful of far calls can use the more expensive spelling.)
at some point surely some dynamic linking is warranted
However, Google, Meta, and ByteDance have encountered x86-64 relocation distance issue with their huge C++ server binaries. To my knowledge industry users in other domains haven't run into this problem.
To address this, Google adopted the medium code model approximately two years ago for its sanitizer and PGO instrumentation builds. CUDA fat binaries also caused problems. I suggest that linker script `INSERT BEFORE/AFTER` for orphan sections (https://reviews.llvm.org/D74375 ) served as a key mitigation.
I hope that a range extension thunk ABI, similar to AArch64/Power, is defined for the x86-64 psABI. It is better than the current long branch pessimization we have with -mcmodel=large.
---
It seems that nobody has run into this .eh_frame_hdr implementation limitation yet
* `.eh_frame_hdr -> .text`: GNU ld and ld.lld only support 32-bit offsets (`table_enc = DW_EH_PE_datarel | DW_EH_PE_sdata4;`) as of Dec 2025.
You can use thunks/trampolines. lld can make them for some architectures, presumably also for x86_64. Though I don't know why it didn't in your case.
But, like the large code model it can be expensive to add trampolines, both in icache performance and just execution if a trampoline is in a particularly hot path.
Also, we, as an industry of software engineers, need to re-examine these hard defaults we thought could never be achieved. Such as the .text limits.
Anyway, very good read.