FRESH

Hacker News

Home

Everything in C is undefined behavior

468 points by lycopodiopsida

by muvlon

17 subcomments

Yes there is tons of surprising and weird UB in C, but this article doesn't do a great job of showcasing it. It barely scratches the surface.
Here's a way weirder example:
```
  volatile int x = 5;
  printf("%d in hex is 0x%x.\n", x, x);
```
This is totally fine if x is just an int, but the volatile makes it UB. Why? 5.1.2.4.1 says any volatile access - including just reading it - is a side effect. 6.5.1.2 says that unsequenced side effects on the same scalar object (in this case, x) are UB. 6.5.3.3.8 tells us that the evaluations of function arguments are indeterminately sequenced w.r.t. each other.
So in common parlance, a "data race" is any concurrent accesses to the same object from different threads, at least one of which is a write. In C, we can have a data race on a single thread and without any writes!

by beeforpork

6 subcomments

The UB in unaligned pointers is even worse: an unaligned pointer in itself is UB, not only an access to it. So even implicit casting a void*v to an int*i (like 'i=v' in C or 'f(v)' when f() accepts an int*) is UB if the cast pointer is not aligned to int.
It is important to understand that this is a C level problem: if you have UB in your C program, then your C program is broken, i.e., it is formally invalid and wrong, because it is against the C language spec. UB is not on the HW, it has nothing to do with crashes or faults. That cast from void* to int* most likely corresponds to no code on the HW at all -- types are in C only, not on the HW, so a cast is a reinterpretation at C level -- and no HW will crash on that cast (because there is not even code for it). You may think that an integer value in a register must be fine, right? No, because it's not about pointers actually being integers in registers on your HW, but your C program is broken by definition if the cast pointer is unaligned.

by quelsolaar

7 subcomments

The 5 stages of learning about UB in C:
-Denial: "I know what signed overflow does on my machine."
-Anger: "This compiler is trash! why doesn't it just do what I say!?"
-Bargaining: "I'm submitting this proposal to wg14 to fix C..."
-Depression: "Can you rely on C code for anything?"
-Acceptance: "Just dont write UB."

by greysphere

5 subcomments

The examples aren't really undefined behavior. They are examples that could become UB based on input/circumstances. Which if you are going to be that generous, every function call is UB because it could exceed stack space. Which is basically true in any language (up to the equivalent def of UB in that language). I feel like c has enough actual rough edges that deserve attention that sensationalism like this muddies folks attention (particularly novices) and can end up doing more harm than good.

by bestouff

6 subcomments

The problem of UB is not really that it may crash in some architecture. The real problem is that the compiler expects UB code to NOT happen, so if you write UB code anyway the compiler (and especially the optimizer) is allowed to translate that to anything that's convenient for its happy path. And sometimes that "anything" can be really unexpected (like removing big chunks of code).

by parasti

15 subcomments

I have never in my 20 years of writing C heard so much about undefined behavior as I have in the past 6 months on Hacker News. It has never entered the conversation. You write the code. If it doesn't work, you debug it and apply a fix or a workaround. Why does the idea of undefined behavior in C get to the front page so consistently?

by jb1991

2 subcomments

Some of the C++ code in this article has not been idiomatic in over a decade, and would be considered a code smell today. The language has evolved into quite a different language than when it was first created. As soon as I saw all of those raw pointers and direct pointer access, it was clear that at least part of this article should be taken with a grain of salt.
The other obvious issue with the overall perspective is that C and C++ are being thrown together directly as if somehow they’re nearly the same language, but they are really very far apart nowadays.

by pizlonator

1 subcomments

The problem is incorrectly assuming that the spec is meaningful in some kind of rigorous way.
It’s not. All that matters is what C compilers actually do and what real C programs expect.
This is a good thing. It creates a culture where the two sides meet each other where they’re at

by debugnik

2 subcomments

As much as I agree with the intro, these examples aren't good and the overall article is just a veil for pushing LLM coding.

by maple3142

2 subcomments

Is this a correct understanding of UB in C? A program P has a set of inputs A that do not trigger UB, and a complementary set of inputs B that do trigger UB. A correct compiler compiles P into an executable P'. For all inputs in A, P' should behave the same as P. However, for any input in B, the is absolutely no requirements on the behavior of P'.

by rom1v

1 subcomments

A concrete example of undefined behavior caused by an unaligned pointer: https://pzemtsov.github.io/2016/11/06/bug-story-alignment-on...

by hunterpayne

0 subcomment

What all these C programmers are pointing out is 2 fold:
- Making a Turing machine have deterministic and predictable results is hard.
- Modern hardware is complex and getting all hardware to behave the same way requires a strong mathematical abstraction.
C was never intended to be a fully defined mathematical abstraction. It was a language which was easy to write a compiler for. That's its original strength. Trying to make it something it isn't is the problem. Either choose a language which does have such abstractions or understand the drawbacks of the tool you are using.
Right tool for the right job.

by psim1

0 subcomment

I like the ideas of this article but would not use SPARC as a main badguy in my examples. A naive and probably popular takeaway would be, "Thank goodness I am not writing for SPARC and don't need to worry about these SPARC architectural concerns!"

by __0x01

4 subcomments

> A problem with this is that in order to confirm the findings, you’ll need an expert human. But generally expert humans are busy doing other things.
The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.
LLM generated code will eventually contain UB.
EDIT: added "eventually"

by rurban

1 subcomments

Very bad advice. Of course good new LLM's know about UB, but you still need to use ubsan (ie - fsanitize=undefined), and not your LLM.

by mjs01

1 subcomments

Integer promotion seems to be the source of many signed integer overflow UB. Why does C have it? Does integer promotion ever have a good part?

by JonChesterfield

0 subcomment

Well, you can't write malloc in conforming C, which hurts rather more than remembering to write bitcast as memcpy on char pointers.
Doesn't matter though because you aren't writing standards conforming C. You're writing whatever dialect your compilers support, and that's probably (module bugs) much better behaved than the spec suggests.
Or you're writing C++ and way more exposed to the adversarial-and-benevolent compiler experience.
The type aliasing rules are the only ones that routinely cause me much annoyance in C and there's always a workaround, whether if it's the launder intrinsic used to implement C++, the may_alias attribute or in extremis dropping into asm. So they're a nuisance not a blocker.

by casey2

0 subcomment

And that's a good thing. UB is another mechanism to speed up the development of compilers, many other languages fall trap to over defining while we lack the methods to solve such problems cleanly (believe me, the modern c++ people have tried). Usually this is the case because they believe strongly that their methods work despite evidence.
As for UB, the compiler has the final say. Nobody should write nontrivial c without understanding their compiler, the same as nobody should write c without understanding their text editor.
Code in other languages breaks between versions, in c there are projects with code from every version at once!
Looking at it another way, work put into a c compiler enables you to write nontrivial code.

by bkallus

0 subcomment

> the OpenBSD project has not been very receptive in the past for bug reports, my sense of “this is probably fine, in practice”, and that if OpenBSD wants to weed out UB from their code base, then that’s a major project that should be done in a better way than me just being the middle man between the LLM and them for a patch here and there.
Part of the reason for all the UB in OpenBSD is that UBSan doesn't run on that platform. When I ported OpenBSD's httpd to Linux, I found that UBSan tripped before the server even came up because the config flag parsing shifts into the MSB of a signed integer.
I tried to contribute back a patch (just make the flag bitfield unsigned), but it was ignored. I think if UBSan ran natively on OpenBSD, then there would be a lot more of these patches, and the maintainers would have to take an official stance on whether they think these bugs matter.

by weinzierl

3 subcomments

A fun one that'd fit list be sequence point violations like
```
    i = i++
```

by commandlinefan

1 subcomments

A lot of this stems from trying to insist that char just means "small" and not "8 bits" and that int means "bigger than that" and not "32 bits". In fairness, K&R dealt with an era where 9 bit architectures existed, but char is 8 bits now. Everywhere.

by codeflo

1 subcomments

> The compiler, and really the underlying hardware too, is playing a game of telephone with your UB intentions.
The part about hardware is wrong BTW. In all the cases about null pointers and out-of-bounds access and integer overflow and whatnot, the hardware semantics are clearly defined, and the assembler code does exactly what is written. The way modern compilers act on your code makes C less safe than assembler in that sense.

by lelanthran

1 subcomments

I read through this in detail... Is it just me, or are these things that are invoked by intentionally bypassing the typing?
I mean, you have to go out of your way and use a cast to get the UB in the first example.
For the `isxdigit` implementation, using a parameter to index into an array without a length check is pretty suspect already. I don't think any of my code actually indexes an array without checking the length in some way.
For the float -> int conversion, converting a float to an int without picking a conversion does not make sense in the first place - math.h has rounding and ceiling functions.
> For all you know the compiler has no internal way to even express your intention here.
I'm human, not a compiler, and even I cannot tell what the intention is behind trying to call NULL as a function. What exactly is expected to happen?
> Because the argument needs to be a pointer, and the NULL macro may be misinterpreted as an integer zero.
I don't think this is true for C. The NULL macro is defined to be a pointer in the C standard, AFAIK. Just because comparisons with zero are allowed, does not imply that the standard implicitly promotes NULL to `int`.
I think only the final one is of note (the 24-bit shift assigned to a uint64_t).

by amiga386

5 subcomments

Can anyone explain why this is undefined behaviour? UBSan calls it "indirect call of a function through a function pointer of the wrong type"

    struct foo {int i;};
    int func(struct foo *x) {return x->i;}
    int main() {
        int (*funcptr)(void*) = (int (*)(void*)) &func;
        struct foo foo = { 42 };
        return funcptr(&foo);
    }

While this is all kosher per the language lawyers:

    struct foo {int i;};
    int func(void *x) {return ((struct foo *)x)->i;}
    int main() {
        int (*funcptr)(void*) = &func;
        struct foo foo = { 42 };
        return funcptr(&foo);
    }

by akiarie

3 subcomments

C is still, by far, the simplest language that we have.
Although many newer languages are safer (with the exclusion of Rust, primarily by being slower) the same kinds of issues that are there in C are there in these languages, their effects are just harder to see.
People complain about C as though they know how to fix it.

by tomcam

1 subcomments

I fear I will be downvoted into oblivion but I also want to learn from this.
First let me state the case for C. It’s meant to be used as a systems language that’s as close to assembly as possible while remaining portable (compared to assembly). As such it’s the first high-level language developed for any new processor.
Given the above predicate: Isn’t everything described in the article as it should be?
Add too much to the language and it becomes less possible to implement on new architectures, right? Because the undefined behavior lets implementors stand up new compilers fairly quickly.
For less undefined behavior isn’t it better to use languages that have that in their DNA? D, Zig, Go, Java, etc?

by wyldfire

0 subcomment

Maybe we should criminalize writing articles about Undefined Behavior that have a "So what do we do now?" subheader but omit any mention of UBSan.

by keyle

0 subcomment

When talking UB, putting C and C++ in the same basket is basically like comparing drunk driving a car and riding a bicycle sober... Both means of transport, very different experience.

by sltr

0 subcomment

For a deep dive on UB with printf, see https://srs.fyi/see-conversions/
> When programming in C, to avoid unexpected pitfalls, one must be acutely aware of a whole slew of implicit behaviors (some of which are implementation-defined or even undefined).

by danborn26

1 subcomments

The scariest part is how many production systems rely on undefined behavior without anyone knowing until a compiler update breaks everything.

0 subcomment

by 1vuio0pswjnm7

0 subcomment

"My point is that ALL nontrivial C and C++ code has UB."
Is "nontrivial" defined
How would one identify "nontrivial" C code
Is there an objective measure (defined)
Or is it a matter of personal opinion that could vary from person to person (undefined)

by bvrmn

0 subcomment

I really like Zig's approach to UB. Especially alignment is a part of type. And all this wordy builtins for conversions. Starring to it makes you think what you doing wrong with data model it requires now 3 lines of casting expression.

by elnatro

3 subcomments

Is there a way to avoid undefined behavior Im C then? Could we write a new C compiler that adds some checks and fixes (e.g. raise documented exceptions) to each undefined behavior?

by kajaktum

0 subcomment

I want a language that is a group of bit (0,1) and the xor operator. Everything else is built on top of that.

by fjfaase

2 subcomments

Is comparing a signed integer with an unsigned integer UB? I resently wrote some code and compiled it with gcc to x86_64 (without optimization) that returned an incorrect answer.

by 0x20cowboy

0 subcomment

Life is undefined behaviour.

by veltas

4 subcomments

From the ANSI C standard:
```
  3.16 undefined behavior: Behavior, upon use of a nonportable or erroneous program construct, of erroneous data, or of indeterminately valued objects, for which this International Standard imposes no requirements.  Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message).
```
Is it just me or did compiler writers apply overly legalistic interpretation to the "no requirements" part in this paragraph? The intent here is extremely clear, that undefined behavior means you're doing something not intended or specified by the language, but that the consequence of this should be somewhat bounded or as expected for the target machine. This is closer to our old school understanding of UB.
By 'bounded', this obviously ignores the security consequences of e.g. buffer overflows, but just because UB can be exploited doesn't mean it's appropriate for e.g. the compiler to exploit it too, that clearly violates the intent of this paragraph.

by raluk

2 subcomments

In C / C++ there are two kinds of undefined behaviour. One is where there is written in standard what UB is. Another one is everthing else that is not in standard.

by y42

0 subcomment

shameless plug, it's part of the Nerd Encyclopedia: it's also called "nasal demons".
https://nickyreinert.de/2023/2023-05-16-nerd-enzyklop%C3%A4d...

by QuiEgo

0 subcomment

C does not abstract differences in underlying hardware well. Systems programmers know if they have an architecture that can't handle unaligned accesses or that the address they are doing load/stores from is a mmio register. Systems programmers know the difference between a virtual address and a physical address and have debugged MPU faults or MMU table walks and page faults more times than they want to think about.
C is horrible for trying to write a portable user-mode program in 2026. There are lots of better options.
C is great for writing low-level system code where you need to optimize performance down to the last cycle. It not abstracting away the hardware is super important for some use cases. A classic example is all of the platform-specific flavors of memcpy in the Linux kernel that are C/assembly hybrids hand-optimized for the SIMD pipelines of some CPUs.
C is a tool, Rust is a tool, Java is a tool, Python is a tool. Use the right tool for the job ¯\_(ツ)_/¯.

by el_pollo_diablo

0 subcomment

> probably meaning on an address that’s a multiple of sizeof(int), but who knows
Sigh. s/sizeof(int)/_Alignof(int)/.
There are good reasons for an implementation to have sizeof(int) = _Alignof(int) and not a mere multiple of it, but if you are going to discuss subtle points and UB, just stick to the language guarantees.
> But let’s say you have a modern machine, where NULL is a pointer to address zero, and you actually have an object there.
You don't program in C on such a machine. Or maybe memory is virtualized, and it does not matter that your object lives at physical address zero, as long as you can map a non-zero virtual address to it.
> So how do you print an uid_t?
```
    if ((uid_t)-1 < (uid_t)0) {
        // uid_t is signed
        printf("%" PRIdMAX, (intmax_t)id);
    } else {
        // uid_t is unsigned
        printf("%" PRIuMAX, (uintmax_t)id);
    }
```
> It’s not rare for the denominator to come from untrusted input.
It's not rare for the array index to come from untrusted input.
It's not rare for the supposedly valid UTF-8 string to come from untrusted input.
...
Why single out division? This problem affects every partially defined operation. In the case of division at least, everyone learned in school that thou shalt not divide by zero. Adding two untrusted integers and forgetting that signed overflow is UB, not defined as a modulo? Your average programmer is much less likely to see that coming.
```
    > unsigned char a = 0xff;
    > unsigned char b = 1;
    > unsigned char zero = 0;
    > bool overflowed = (a + b) == zero;
    >
    > unsigned char a = 0x80;
    > uint64_t b = a << 24;
```
Please. Convert your operands to wide enough types before the operation. Convert your results back to narrow enough types to compensate for integer promotion to wider types than you would have liked. Do that consistently, and you're good.
Here:
```
    unsigned char a = 0xff;
    unsigned char b = 1;
    unsigned char zero = 0;
    bool overflowed = (unsigned char)(a + b) == zero;

    unsigned char a = 0x80;
    uint64_t b = (uint32_t)a << 24;
```

by justmarc

0 subcomment

The art is actually making sure it all stays defined behavior

by alper

0 subcomment

Isn't the article mostly saying that SPARC sucks?

by saltyoldman

0 subcomment

Probably not "everything" the vast vast vast majority of everything you are looking at on your screen right now is written in C.

by DostLeFan

0 subcomment

Very interesting article. I'm in love with C++, and I cannot say that I'm a good developer, but interesting to discover where UB can be. (Sorry I'm not a good english speaker)

0 subcomment

by dmitrygr

3 subcomments

I stoped reading about here:
```
    > bool parse_packet(const uint8_t* bytes) {
    >   const int* magic_intp = (const int*)bytes;   // UB!
```
Author, if you are reading this, please cite the spec section explaining that this is UB. Dereferencing the produced pointer may be UB, but casting itself is not, since uint8_t is ~ char and char* can be cast to and from any type.
you might try to argue that uint8_t is not necessarily char, and while it is true that implementations of C can exist where CHAR_BIT > 8, but those do not have uint8_t defined (as per spec), so if you have uint8_t, then it is "unsigned char", which makes this cast perfectly safe and defined as far as i can tell. Of course CHAR_BIT is required to be >= 8, so if it is not >8, it is exactly 8. (In any case, whether uint8_t is literally a typedef of unsigned char is implementation-defined and not actually relevant to whether the cast itself is valid -- it is)

by up2isomorphism

0 subcomment

U just need to read the title and 5 lines to know this must be a rust guy.

by stackedinserter

0 subcomment

How can it be valid implementation of isxdigit?
``` int isxdigit(int c) { if (c == EOF) { return false; } return some_array[c]; } ```
If you write code like this, then everything in programming is UB.

by my-next-account

2 subcomments

Hello, it's me. I'm not afraid of UB.

by fithisux

0 subcomment

UB can also have impact in logical cohesion of codebase.

by synergy20

0 subcomment

if c is more ub unsafe than it seems,what is the solution here

by cracki

1 subcomments

We know. This is not news.

by VimEscapeArtist

0 subcomment

Wait until he discovers PowerShell ;D

by NooneAtAll3

1 subcomments

feels like https://xkcd.com/1499/
the only people complaining about being able to do awful things are people that do awful things

by groby_b

0 subcomment

"not correctly aligned (probably meaning on an address that’s a multiple of sizeof(int), but who knows)"
I stopped reading there. If you have decades of experience in C/C++ and don't know what that means (and that it's arch specific), I'll assume those decades were mostly the same year over and over.
C/C++ are horrible languages, but they deserve better opponents than that.

by SanjayMehta

0 subcomment

I used to teach C programming and one time I got anonymous feedback: "when this instructor doesn't know the answer he says "it's compiler dependent.""
Shrug.

by jraph

2 subcomments

Yet another push to use LLMs after casting fear. Now it should be illegal not to use LLMs. A good start of the day.
(I hope casting fear is not UB)

by pphysch

0 subcomment

It's also worth highlighting that C is perhaps the most officially standardized programming language in history.
What a contradiction. Strong evidence that standard-driven programming language development is much worse than implementation-driven development. Standards should be used for data types and external interfaces/protocols, not programming languages.

0 subcomment

by EGreg

0 subcomment

a good case can be made that use of C++ is a SOX violation
So Linus was right? But for a second reason too:
C++ is a horrible language. It’s made more horrible by the fact that a lot of substandard programmers use it, to the point where it’s much, much easier to generate total and utter crap with it. Quite frankly, even if the choice of C were to do _nothing_ but keep the C++ programmers out, that in itself would be a huge reason to use C.
That is, accepting C++ code from programmers who use C++ could be a SOX violation ;-)

by stackghost

7 subcomments

Anyone who uses the construction "C/C++" doesn't write modern C++, and probably isn't very familiar with the recent revisions despite TFA's claims of writing it every day for decades.
Far from being just "C with classes", modern C++ is very different than C. The language is huge and complex, for sure, but nobody is forced to use all of it.
No HN comment can possibly cover all the use cases of C++ but in general, unless you have a very good reason not to:
- eschewing boomer loops in favor of ranges
- using RAII with smart pointers
- move semantics
- using STL containers instead of raw arrays
- borrowing using spans and string views
These things go a long way towards, shall we say, "safe-ish" code without UB. It is not memory-safe enforced at the language level, like Rust, but the upshot is you never need to deal with the Rust community :^)

by JayJSpringpeace

0 subcomment

[flagged]

by jim33442

0 subcomment

[dead]

by creatorsstack

0 subcomment

[flagged]

by ivandotcodes

0 subcomment

[dead]

by jdw64

0 subcomment

[dead]

by tenego

0 subcomment

[flagged]

by rahadbhuiya

0 subcomment

[dead]

by black_13

0 subcomment

[dead]

by nurettin

0 subcomment

[dead]

by llggbbtt

0 subcomment

[flagged]

by nokeya

2 subcomments

Ok, and?

by Webhix

0 subcomment

maybe rewrite this in go?)

by benj111

0 subcomment

The issue for me with posts like this is that it misses the issue.
Unaligned pointer accesses are UB because different systems handle it differently. This 'should' be to allow the program to be portable by doing what the system normally does.
Instead it's been highjacked by compiler writers, with the logic that "X is UB, therefore can't happen, therefore can be optimised away."
Int c = abs(a) + abs(b); If (a > c) //overflow
Is UB because some system might do overflow differently. In practice every system wraps around.
That should be a valid check, instead it gets optimised away because it 'can't' happen.
C gives you enough rope to hang yourself. The compiler writers don't trust you to use the rope properly.

by logicchains

0 subcomment

The concept of undefined behaviour is also a very useful lens for understanding LLM-based coding. Anything you don't explicitly specify is undefined behavior, so if you don't want the LLM to potentially pick a ridiculous implementation for some aspect of an application, make sure to explicitly specify how it should be implemented.

by reinhash

0 subcomment

Rust.

by mbrock

1 subcomments

most languages don't even HAVE a specification so in most languages literally EVERYTHING everything is undefined behavior

by grougnax

0 subcomment

Use Rust!

by liamd1988

0 subcomment

When use C ,keep using char* not mess with int*

by momo26

1 subcomments

Debugging in C is soooo hard. When I was writing Malloc Lab in system course, there were uncountable undefined and out of range :(

by bullen

0 subcomment

Everything in Java is defined behaviour, you need a VM with GC to remain sane.
Everything else is a waste of time!

by ricardobeat

0 subcomment

I’ve been heavily invested in https://c3-lang.org/ the past couple months. How does it look from this perspective to someone with C experience?

by nullpwr

1 subcomments

Excellent post. But it's addressed to the wrong people.
The problem lies with compilers, not with the language and its specification, or with the creators of the C programming language.
Anyone can write a compiler that transforms all undefined behaviors (UB) into defined behaviors (DB). And your compiler will be used by people, including me.