FRESH

Hacker News

Home

Kernel bugs hide for 2 years on average. Some hide for 20

288 points by kmavm

by Fiveplus

17 subcomments

Before the "rewrite it in Rust" comments take over the thread:
It is worth noting that the class of bugs described here (logic errors in highly concurrent state machines, incorrect hardware assumptions) wouldn't necessarily be caught by the borrow checker. Rust is fantastic for memory safety, but it will not stop you from misunderstanding the spec of a network card or writing a race condition in unsafe logic that interacts with DMA.
That said, if we eliminated the 70% of bugs that are memory safety issues, the SNR ratio for finding these deep logic bugs would improve dramatically. We spend so much time tracing segfaults that we miss the subtle corruption bugs.

by gjfr

1 subcomments

Interesting! We did a similar analysis on Content Security Policy bugs in Chrome and Firefox some time ago, where the average bug-to-report time was around 3 years and 1 year, respectively. https://www.usenix.org/conference/usenixsecurity23/presentat...
Our bug dataset was way smaller, though, as we had to pinpoint all bug introductions unfortunately. It's nice to see the Linux project uses proper "Fixes: " tags.

by giamma

2 subcomments

Is the intention of the author to use the number of years bugs stay "hidden" as a metric of the quality of the kernel codebase or of the performance of the maintainers? I am asking because at some point the articles says "We're getting faster".
IMHO a fact that a bug hides for years can also be indication that such bug had low severity/low priority and therefore that the overall quality is very good. Unless the time represents how long it takes to reproduce and resolve a known bug, but in such case I would not say that "bug hides" in the kernel.

by jackfranklyn

2 subcomments

The state machine race pattern resonates beyond kernel work. I've seen similar bugs hide for years in application code - transaction state edge cases that only trigger under specific sequences of user actions that nobody tests for.
The median lifetimes are fascinating. Race conditions at 5.1 years vs null-deref at 2.2 years makes intuitive sense - the former needs specific timing to manifest, while the latter will crash obviously once you hit the code path. The ones that need rare conditions to trigger are the ones that survive longest.

by NewsaHackO

1 subcomments

It may be just my system, but the times look like hyperlinks but aren't for some reason. It is especially disappointing that the commit hashes don't link to the actual commit in the kernel repo.

by silver_sun

1 subcomments

Their section on "Dataset limitations" says that the study "Only captures bugs with Fixes: tags (~28% of fix commits)."
Just worth noting that it is a significant extrapolation from only "28%" of fix commits to assume that the average is 2 years.

by michaelcampbell

0 subcomment

Thank goodness for reader mode. The transparent background where the text is with the wiggly line background is... challenging.

by ValdikSS

3 subcomments

grsecurity project has fixed many security bugs but did not contribute back, as they're profiting from selling the patchset.
It's not uncommon for the bugs they found to be rediscovered 6-7 years later.
https://xcancel.com/spendergrsec

by sedatk

2 subcomments

Firefox bugs stay in the open for that long.

by dpc_01234

1 subcomments

Might be obviously, but there is definitely a lot of biases in the data here. It's unavoidable. E.g. many bugs will not be detected, but they will be removed when the code is rewritten. So code that is refactored more often will have lower age of fixed bugs. Components/subsystems that are heavily used will detect bugs faster. Some subsystems by their very nature can tolerate bugs more, while some by necessity will need to be more correct (like bpf).

by redleader55

0 subcomment

I don't think the problem is the kernel. Kernel bugs stay hidden because no one runs recent Kernels.
My Pixel 8 runs kernel a stable minor from 6.1, which was released more than 4 years ago. Yes, fixes get backported to it, but the new features in 6.2->6.19 stay unused on that hardware. All the major distros suffer from the same problem, most people are not running them in production
Most hyperscalers are running old kernel versions on which they do backports. If you go Linux conferences you hear folks from big companies mentioning 4.xx, 3.xx kernels, in 2025.

0 subcomment

by MORPHOICES

2 subcomments

Deep bugs, particularly in kernels, can go unnoticed for years, according to analyses I keep seeing. Decades at times. ~
That seems frightening at first. However, the more I consider it, the more it seems... predictable.
The mental model that I find useful:
Users discover surface bugs.
Deep bugs only appear in infrequent combinations.
For some bugs to show up, new context is required.
I've observed a few patterns:
Undefined behavior-related bugs are permanently hidden.
Logic errors are less important than uncommon hardware or timing conditions.
Long before they can be exploited, security flaws frequently exist.
I'm curious what other people think of this:
Do persistent bugs indicate stability or failure?
What typically leads to their discovery?
To what extent do you trust "well-tested" code?

by sureglymop

0 subcomment

Only tangentially related but maybe someone here can help me.
I have a server which has many peripherals and multiple GPUs. Now, I can use vfio and vfio-pcio to memory map and access their registers in user space. My question is, how could I start with kernel driver development? And I specifically mean the dev setup.
Would it be a good idea to use vfio with or without a vm to write and test drivers? How to best debug, reload and test changing some code of an existing driver?

by zkmon

0 subcomment

A bug is a piece of code that doesn't agree with requirements or architecture. The misalignment can not be attributed to code alone.

by calebm

0 subcomment

It's interesting to consider that the same phenomenon may also hold true for humanity's psychological software.

by GaryBluto

1 subcomments

What's with the odd scribbles in the background?

by eab-

0 subcomment

I'd find this article a bit more compelling if it was used to find current introduced bugs, instead of just using a holdout set

by Adrian-ChatLocl

0 subcomment

Still probably a lot better than Windows.

by blueboo

0 subcomment

“In a sufficiently complex system, malfunction or even total non-function may go undetected for long periods, if ever”
John Gall, The Systems Bible

0 subcomment

by ryukoposting

1 subcomments

This is fascinating stuff, especially the per-subsystem data. I've worked with CAN in several different professional and amateur settings, I'm not surprised to see it near the bottom of this list. That's not a dig against the kernel or the folks who work on it... more of a heavy sigh about the state of the industries that use CAN.
On a related note, I'm seeing a correlation between "level of hoopla" and a "level of attention/maintenance." While it's hard to distinguish that correlation from "level of use," the fact that CAN is so far down the list suggests to me that hoopla matters; it's everywhere but nobody talks about it. If a kernel bug takes down someone's datacenter, boy are we gonna hear about it. But if a kernel bug makes a DeviceNet widget freak out in a factory somewhere? Probably not going to make the front page of HN, let alone CNN.

by jmyeet

3 subcomments

The lesson here is that people have an unrealistic view of how complex it is to write correct and safe multithreaded code on multi-core, multi-thread, assymmetric core, out-of-order processors. This is no shade to kernel developers. Rather, I direct this at people who seem to you can just create a thread pool in C++ and solve all your concurrency problems.
One criticism of Rust (and, no, I'm not saying "rewrite it in Rust", to be clear) is that the borrow checker can be hard to use whereas many C++ engineers (in particular, for some reason) seem to argue that it's easier to write in C++. I have two things to say about that:
1. It's not easier in C++. Nothing is. C++ simply allows you to make mistakes without telling you. GEtting things correct in C++ is just as difficult as any other language if not more so due to the language complexity; and
2. The Rust borrow checker isn't hard or difficult to use. What you're doing is hard and difficult to do correctly.
This is I favor cooperative multitasking and using battle-tested concurrency abstractions whenever possible. For example the cooperative async-await of Hack and the model of a single thread responding to a request then discarding everything in PHP/Hack is virtually ideal (IMHO) for serving Web traffic.
I remember reading about Google's work on various C++ tooling including valgrind and that they exposed concurrency bugs in their own code that had lain dormant for up to a decade. That's Google with thousands of engineers and some very talented engineers at that.

by eulgro

0 subcomment

From the stats we see that most bugs effectively come from the limitations of the language.
Impressive results on the model, I'm surprised they improved it with very simple heuristics. Hopefully this tool will be made available to the kernel developers and integrated to the workflow.

by snvzz

5 subcomments

Millions of lines of code, all running in supervisor mode.
One bug is all it takes to compromise the entire system.
The monolithic UNIX kernel was a good design in the 60s; Today, we should know better[0][1].
0. https://sel4.systems/
1. https://genode.org/

0 subcomment

by esseph

1 subcomments

Imagine if no one outside a select circle ever got to examine the code.

by burnt-resistor

0 subcomment

Speaking of nasty kernel bugs although on another platform, there's a nasty one in either Microsoft's Win 11 nwifi.sys handling of deadlock conditions or Qualcomm's QCNCM865 FastConnect 7800 WCN785x driver that panics because of a watchdog failure in nwifi!MP6SendNBLInternal+0x4b guarded by a deadlocked ndis!NdisAcquireRWLockRead+0x8b. It "BSODs" the system rather than doing something sane like dropping a packet or retransmitting.
Am I the only unreasonable maniac who wants a very long-term stable, seL4-like capability-based, ubiquitous, formally-verified μkernel that rarely/never crashes completely* because drivers are just partially-elevated programs sprinkled with transaction guards and rollback code for critical multiple resource access coordination patterns? (I miss hacking on MINIX 2.)
* And never need to reboot or interrupt server/user desktop activities because the core μkernel basically never changes since it's tiny and proven correct.

by YouAreWRONGtoo

0 subcomment

[dead]

by maximgeorge

0 subcomment

[dead]