Here's a way weirder example:
volatile int x = 5;
printf("%d in hex is 0x%x.\n", x, x);
This is totally fine if x is just an int, but the volatile makes it UB. Why? 5.1.2.4.1 says any volatile access - including just reading it - is a side effect. 6.5.1.2 says that unsequenced side effects on the same scalar object (in this case, x) are UB. 6.5.3.3.8 tells us that the evaluations of function arguments are indeterminately sequenced w.r.t. each other.So in common parlance, a "data race" is any concurrent accesses to the same object from different threads, at least one of which is a write. In C, we can have a data race on a single thread and without any writes!
It is important to understand that this is a C level problem: if you have UB in your C program, then your C program is broken, i.e., it is formally invalid and wrong, because it is against the C language spec. UB is not on the HW, it has nothing to do with crashes or faults. That cast from void* to int* most likely corresponds to no code on the HW at all -- types are in C only, not on the HW, so a cast is a reinterpretation at C level -- and no HW will crash on that cast (because there is not even code for it). You may think that an integer value in a register must be fine, right? No, because it's not about pointers actually being integers in registers on your HW, but your C program is broken by definition if the cast pointer is unaligned.
-Denial: "I know what signed overflow does on my machine."
-Anger: "This compiler is trash! why doesn't it just do what I say!?"
-Bargaining: "I'm submitting this proposal to wg14 to fix C..."
-Depression: "Can you rely on C code for anything?"
-Acceptance: "Just dont write UB."
The other obvious issue with the overall perspective is that C and C++ are being thrown together directly as if somehow they’re nearly the same language, but they are really very far apart nowadays.
It’s not. All that matters is what C compilers actually do and what real C programs expect.
This is a good thing. It creates a culture where the two sides meet each other where they’re at
- Making a Turing machine have deterministic and predictable results is hard.
- Modern hardware is complex and getting all hardware to behave the same way requires a strong mathematical abstraction.
C was never intended to be a fully defined mathematical abstraction. It was a language which was easy to write a compiler for. That's its original strength. Trying to make it something it isn't is the problem. Either choose a language which does have such abstractions or understand the drawbacks of the tool you are using.
Right tool for the right job.
The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.
LLM generated code will eventually contain UB.
EDIT: added "eventually"
Doesn't matter though because you aren't writing standards conforming C. You're writing whatever dialect your compilers support, and that's probably (module bugs) much better behaved than the spec suggests.
Or you're writing C++ and way more exposed to the adversarial-and-benevolent compiler experience.
The type aliasing rules are the only ones that routinely cause me much annoyance in C and there's always a workaround, whether if it's the launder intrinsic used to implement C++, the may_alias attribute or in extremis dropping into asm. So they're a nuisance not a blocker.
As for UB, the compiler has the final say. Nobody should write nontrivial c without understanding their compiler, the same as nobody should write c without understanding their text editor.
Code in other languages breaks between versions, in c there are projects with code from every version at once!
Looking at it another way, work put into a c compiler enables you to write nontrivial code.
Part of the reason for all the UB in OpenBSD is that UBSan doesn't run on that platform. When I ported OpenBSD's httpd to Linux, I found that UBSan tripped before the server even came up because the config flag parsing shifts into the MSB of a signed integer.
I tried to contribute back a patch (just make the flag bitfield unsigned), but it was ignored. I think if UBSan ran natively on OpenBSD, then there would be a lot more of these patches, and the maintainers would have to take an official stance on whether they think these bugs matter.
i = i++The part about hardware is wrong BTW. In all the cases about null pointers and out-of-bounds access and integer overflow and whatnot, the hardware semantics are clearly defined, and the assembler code does exactly what is written. The way modern compilers act on your code makes C less safe than assembler in that sense.
I mean, you have to go out of your way and use a cast to get the UB in the first example.
For the `isxdigit` implementation, using a parameter to index into an array without a length check is pretty suspect already. I don't think any of my code actually indexes an array without checking the length in some way.
For the float -> int conversion, converting a float to an int without picking a conversion does not make sense in the first place - math.h has rounding and ceiling functions.
> For all you know the compiler has no internal way to even express your intention here.
I'm human, not a compiler, and even I cannot tell what the intention is behind trying to call NULL as a function. What exactly is expected to happen?
> Because the argument needs to be a pointer, and the NULL macro may be misinterpreted as an integer zero.
I don't think this is true for C. The NULL macro is defined to be a pointer in the C standard, AFAIK. Just because comparisons with zero are allowed, does not imply that the standard implicitly promotes NULL to `int`.
I think only the final one is of note (the 24-bit shift assigned to a uint64_t).
struct foo {int i;};
int func(struct foo *x) {return x->i;}
int main() {
int (*funcptr)(void*) = (int (*)(void*)) &func;
struct foo foo = { 42 };
return funcptr(&foo);
}
While this is all kosher per the language lawyers: struct foo {int i;};
int func(void *x) {return ((struct foo *)x)->i;}
int main() {
int (*funcptr)(void*) = &func;
struct foo foo = { 42 };
return funcptr(&foo);
}Although many newer languages are safer (with the exclusion of Rust, primarily by being slower) the same kinds of issues that are there in C are there in these languages, their effects are just harder to see.
People complain about C as though they know how to fix it.
First let me state the case for C. It’s meant to be used as a systems language that’s as close to assembly as possible while remaining portable (compared to assembly). As such it’s the first high-level language developed for any new processor.
Given the above predicate: Isn’t everything described in the article as it should be?
Add too much to the language and it becomes less possible to implement on new architectures, right? Because the undefined behavior lets implementors stand up new compilers fairly quickly.
For less undefined behavior isn’t it better to use languages that have that in their DNA? D, Zig, Go, Java, etc?
> When programming in C, to avoid unexpected pitfalls, one must be acutely aware of a whole slew of implicit behaviors (some of which are implementation-defined or even undefined).
Is "nontrivial" defined
How would one identify "nontrivial" C code
Is there an objective measure (defined)
Or is it a matter of personal opinion that could vary from person to person (undefined)
3.16 undefined behavior: Behavior, upon use of a nonportable or erroneous program construct, of erroneous data, or of indeterminately valued objects, for which this International Standard imposes no requirements. Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message).
Is it just me or did compiler writers apply overly legalistic interpretation to the "no requirements" part in this paragraph? The intent here is extremely clear, that undefined behavior means you're doing something not intended or specified by the language, but that the consequence of this should be somewhat bounded or as expected for the target machine. This is closer to our old school understanding of UB.By 'bounded', this obviously ignores the security consequences of e.g. buffer overflows, but just because UB can be exploited doesn't mean it's appropriate for e.g. the compiler to exploit it too, that clearly violates the intent of this paragraph.
https://nickyreinert.de/2023/2023-05-16-nerd-enzyklop%C3%A4d...
C is horrible for trying to write a portable user-mode program in 2026. There are lots of better options.
C is great for writing low-level system code where you need to optimize performance down to the last cycle. It not abstracting away the hardware is super important for some use cases. A classic example is all of the platform-specific flavors of memcpy in the Linux kernel that are C/assembly hybrids hand-optimized for the SIMD pipelines of some CPUs.
C is a tool, Rust is a tool, Java is a tool, Python is a tool. Use the right tool for the job ¯\_(ツ)_/¯.
Sigh. s/sizeof(int)/_Alignof(int)/.
There are good reasons for an implementation to have sizeof(int) = _Alignof(int) and not a mere multiple of it, but if you are going to discuss subtle points and UB, just stick to the language guarantees.
> But let’s say you have a modern machine, where NULL is a pointer to address zero, and you actually have an object there.
You don't program in C on such a machine. Or maybe memory is virtualized, and it does not matter that your object lives at physical address zero, as long as you can map a non-zero virtual address to it.
> So how do you print an uid_t?
if ((uid_t)-1 < (uid_t)0) {
// uid_t is signed
printf("%" PRIdMAX, (intmax_t)id);
} else {
// uid_t is unsigned
printf("%" PRIuMAX, (uintmax_t)id);
}
> It’s not rare for the denominator to come from untrusted input.It's not rare for the array index to come from untrusted input.
It's not rare for the supposedly valid UTF-8 string to come from untrusted input.
...
Why single out division? This problem affects every partially defined operation. In the case of division at least, everyone learned in school that thou shalt not divide by zero. Adding two untrusted integers and forgetting that signed overflow is UB, not defined as a modulo? Your average programmer is much less likely to see that coming.
> unsigned char a = 0xff;
> unsigned char b = 1;
> unsigned char zero = 0;
> bool overflowed = (a + b) == zero;
>
> unsigned char a = 0x80;
> uint64_t b = a << 24;
Please. Convert your operands to wide enough types before the operation. Convert your results back to narrow enough types to compensate for integer promotion to wider types than you would have liked. Do that consistently, and you're good.Here:
unsigned char a = 0xff;
unsigned char b = 1;
unsigned char zero = 0;
bool overflowed = (unsigned char)(a + b) == zero;
unsigned char a = 0x80;
uint64_t b = (uint32_t)a << 24; > bool parse_packet(const uint8_t* bytes) {
> const int* magic_intp = (const int*)bytes; // UB!
Author, if you are reading this, please cite the spec section explaining that this is UB. Dereferencing the produced pointer may be UB, but casting itself is not, since uint8_t is ~ char and char* can be cast to and from any type.you might try to argue that uint8_t is not necessarily char, and while it is true that implementations of C can exist where CHAR_BIT > 8, but those do not have uint8_t defined (as per spec), so if you have uint8_t, then it is "unsigned char", which makes this cast perfectly safe and defined as far as i can tell. Of course CHAR_BIT is required to be >= 8, so if it is not >8, it is exactly 8. (In any case, whether uint8_t is literally a typedef of unsigned char is implementation-defined and not actually relevant to whether the cast itself is valid -- it is)
``` int isxdigit(int c) { if (c == EOF) { return false; } return some_array[c]; } ```
If you write code like this, then everything in programming is UB.
the only people complaining about being able to do awful things are people that do awful things
I stopped reading there. If you have decades of experience in C/C++ and don't know what that means (and that it's arch specific), I'll assume those decades were mostly the same year over and over.
C/C++ are horrible languages, but they deserve better opponents than that.
Shrug.
(I hope casting fear is not UB)
What a contradiction. Strong evidence that standard-driven programming language development is much worse than implementation-driven development. Standards should be used for data types and external interfaces/protocols, not programming languages.
So Linus was right? But for a second reason too:
C++ is a horrible language. It’s made more horrible by the fact that a lot of substandard programmers use it, to the point where it’s much, much easier to generate total and utter crap with it. Quite frankly, even if the choice of C were to do _nothing_ but keep the C++ programmers out, that in itself would be a huge reason to use C.
That is, accepting C++ code from programmers who use C++ could be a SOX violation ;-)
Far from being just "C with classes", modern C++ is very different than C. The language is huge and complex, for sure, but nobody is forced to use all of it.
No HN comment can possibly cover all the use cases of C++ but in general, unless you have a very good reason not to:
- eschewing boomer loops in favor of ranges
- using RAII with smart pointers
- move semantics
- using STL containers instead of raw arrays
- borrowing using spans and string views
These things go a long way towards, shall we say, "safe-ish" code without UB. It is not memory-safe enforced at the language level, like Rust, but the upshot is you never need to deal with the Rust community :^)
Unaligned pointer accesses are UB because different systems handle it differently. This 'should' be to allow the program to be portable by doing what the system normally does.
Instead it's been highjacked by compiler writers, with the logic that "X is UB, therefore can't happen, therefore can be optimised away."
Int c = abs(a) + abs(b); If (a > c) //overflow
Is UB because some system might do overflow differently. In practice every system wraps around.
That should be a valid check, instead it gets optimised away because it 'can't' happen.
C gives you enough rope to hang yourself. The compiler writers don't trust you to use the rope properly.
Everything else is a waste of time!
The problem lies with compilers, not with the language and its specification, or with the creators of the C programming language.
Anyone can write a compiler that transforms all undefined behaviors (UB) into defined behaviors (DB). And your compiler will be used by people, including me.