Every frame (i.e. ~60FPS) Guild Wars would allocate random memory, run math-heavy computations, and compare the results with a table of known values. Around 1 out of 1000 computers would fail this test!
We'd save the test result to the registry and include the result in automated bug reports.
The common causes we discovered for the problem were:
- overclocked CPU
- bad memory wait-state configuration
- underpowered power supply
- overheating due to under-specced cooling fans or dusty intakes
These problems occurred because Guild Wars was rendering outdoor terrain, and so pushed a lot of polygons compared to many other 3d games of that era (which can clip extensively using binary-space partitioning, portals, etc. that don't work so well for outdoor stuff). So the game caused computers to run hot.
Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause.
And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.
Sometimes I'm amazed that computers even work at all!
Incidentally, my contribution to all this was to write code to launch the browser upon test-failure, and load up a web page telling players to clean out their dusty computer fan-intakes.
Even with only about 1 in 1000 users enabling telemetry, it has been an invaluable source of information about crashes. In most cases it is easy to reconstruct a test case that reproduces the problem, and the bug is fixed within an hour. We have fixed dozens of bugs this way. When the cause is not obvious, we "refine" the crash by adding if-statements and assertions so that after the next release we gain one additional bit of information from the stack trace about the state of execution.
However there was always a stubborn tail of field reports that couldn't be explained: corrupt stack pointers, corrupt g registers (the thread-local pointer to the current goroutine), or panics dereferencing a pointer that had just passed a nil check. All of these point to memory corruption.
In theory anything is possible if you abuse unsafe or have a data race, but I audited every use of unsafe in the executable and am convinced they are safe. Proving the absence of data races is harder, but nonetheless races usually exhibit some kind of locality in what variable gets clobbered, and that wasn't the case here.
In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).
As a programmer I've been burned too many times by prematurely blaming the compiler or runtime for mistakes in one's own code, so it took a long time to gain the confidence to suspect the foundations in this case. But I recently did some napkin math (see https://github.com/golang/go/issues/71425#issuecomment-39685...) and came to the conclusion that the surprising number of inexplicable field reports--about 10/week among our users--is well within the realm of faulty hardware, especially since our users are overwhelmingly using laptops, which don't have parity memory.
I would love to get definitive confirmation though. I wonder what test the Firefox team runs on memory in their crash reporting software.
Crashes caused by resource exhaustion are still software bugs in Firefox. At least on sane operating systems where memory isn't over-comitted.
I find this impossible to believe.
If this were so all devs for apps, games, etc... would be talking about this but since this is the first time I'm hearing about this I'm seriously doubting this.
>> This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem.
Might be the case, but 10% is still huge.
There imo has to be something else going on. Either their userbase/tracking is biased or something else...
We have long known that single bit errors in RAM are basically "normal" in terms of modern computers. Google did this research in 2009 to quantify the number of error events in commodity DRAM https://static.googleusercontent.com/media/research.google.c...
They found 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year.
At the time, they did not see an increase in this rate in "new" RAM technologies, which I think is DDR3 at that time. I wonder if there has been any change since then.
A few years ago, I changed from putting my computer to sleep every night, to shutting it down every night. I boot it fresh every day, and the improvements are dramatic. RAM errors will accumulate if you simply put your computer to sleep regularly.