It is worth noting that the class of bugs described here (logic errors in highly concurrent state machines, incorrect hardware assumptions) wouldn't necessarily be caught by the borrow checker. Rust is fantastic for memory safety, but it will not stop you from misunderstanding the spec of a network card or writing a race condition in unsafe logic that interacts with DMA.
That said, if we eliminated the 70% of bugs that are memory safety issues, the SNR ratio for finding these deep logic bugs would improve dramatically. We spend so much time tracing segfaults that we miss the subtle corruption bugs.
Our bug dataset was way smaller, though, as we had to pinpoint all bug introductions unfortunately. It's nice to see the Linux project uses proper "Fixes: " tags.
IMHO a fact that a bug hides for years can also be indication that such bug had low severity/low priority and therefore that the overall quality is very good. Unless the time represents how long it takes to reproduce and resolve a known bug, but in such case I would not say that "bug hides" in the kernel.
The median lifetimes are fascinating. Race conditions at 5.1 years vs null-deref at 2.2 years makes intuitive sense - the former needs specific timing to manifest, while the latter will crash obviously once you hit the code path. The ones that need rare conditions to trigger are the ones that survive longest.
Just worth noting that it is a significant extrapolation from only "28%" of fix commits to assume that the average is 2 years.
It's not uncommon for the bugs they found to be rediscovered 6-7 years later.
My Pixel 8 runs kernel a stable minor from 6.1, which was released more than 4 years ago. Yes, fixes get backported to it, but the new features in 6.2->6.19 stay unused on that hardware. All the major distros suffer from the same problem, most people are not running them in production
Most hyperscalers are running old kernel versions on which they do backports. If you go Linux conferences you hear folks from big companies mentioning 4.xx, 3.xx kernels, in 2025.
That seems frightening at first. However, the more I consider it, the more it seems... predictable.
The mental model that I find useful:
Users discover surface bugs.
Deep bugs only appear in infrequent combinations.
For some bugs to show up, new context is required.
I've observed a few patterns:
Undefined behavior-related bugs are permanently hidden.
Logic errors are less important than uncommon hardware or timing conditions.
Long before they can be exploited, security flaws frequently exist.
I'm curious what other people think of this:
Do persistent bugs indicate stability or failure?
What typically leads to their discovery?
To what extent do you trust "well-tested" code?
I have a server which has many peripherals and multiple GPUs. Now, I can use vfio and vfio-pcio to memory map and access their registers in user space. My question is, how could I start with kernel driver development? And I specifically mean the dev setup.
Would it be a good idea to use vfio with or without a vm to write and test drivers? How to best debug, reload and test changing some code of an existing driver?
John Gall, The Systems Bible
On a related note, I'm seeing a correlation between "level of hoopla" and a "level of attention/maintenance." While it's hard to distinguish that correlation from "level of use," the fact that CAN is so far down the list suggests to me that hoopla matters; it's everywhere but nobody talks about it. If a kernel bug takes down someone's datacenter, boy are we gonna hear about it. But if a kernel bug makes a DeviceNet widget freak out in a factory somewhere? Probably not going to make the front page of HN, let alone CNN.
One criticism of Rust (and, no, I'm not saying "rewrite it in Rust", to be clear) is that the borrow checker can be hard to use whereas many C++ engineers (in particular, for some reason) seem to argue that it's easier to write in C++. I have two things to say about that:
1. It's not easier in C++. Nothing is. C++ simply allows you to make mistakes without telling you. GEtting things correct in C++ is just as difficult as any other language if not more so due to the language complexity; and
2. The Rust borrow checker isn't hard or difficult to use. What you're doing is hard and difficult to do correctly.
This is I favor cooperative multitasking and using battle-tested concurrency abstractions whenever possible. For example the cooperative async-await of Hack and the model of a single thread responding to a request then discarding everything in PHP/Hack is virtually ideal (IMHO) for serving Web traffic.
I remember reading about Google's work on various C++ tooling including valgrind and that they exposed concurrency bugs in their own code that had lain dormant for up to a decade. That's Google with thousands of engineers and some very talented engineers at that.
Impressive results on the model, I'm surprised they improved it with very simple heuristics. Hopefully this tool will be made available to the kernel developers and integrated to the workflow.
One bug is all it takes to compromise the entire system.
The monolithic UNIX kernel was a good design in the 60s; Today, we should know better[0][1].
Am I the only unreasonable maniac who wants a very long-term stable, seL4-like capability-based, ubiquitous, formally-verified μkernel that rarely/never crashes completely* because drivers are just partially-elevated programs sprinkled with transaction guards and rollback code for critical multiple resource access coordination patterns? (I miss hacking on MINIX 2.)
* And never need to reboot or interrupt server/user desktop activities because the core μkernel basically never changes since it's tiny and proven correct.