- I’d really, really like to know what microcontroller family this was found on. Assuming that this is a safety processor (lockstep, ECC, etc) it suggests that ECC was insufficient for the level of bit flips they’re seeing — and if the concern is data corruption, not unintended restart, it means it’s enough flips in one word to be undetectable. The environment they’re operating in isn’t that different from everyone else, so unless they ate some margin elsewhere (bad voltage corner or something), this can definitely be relevant to others. Also would be interesting to know if it’s NVM or SRAM that’s effected.
- The design of the system is very interesting, particularly how it expects to handle errors.
In 90's Telco, you used to have a pair of systems and if they disagreed, they would decide which side was bad and disable it.
In modern cloud, you accept there are errors. There's another request in ~10+ms. You only look when the error rate becomes commercially important.
My understanding of spacecraft is that there would be 3 independent implementations and they would vote.
The plane has a matrix of sensors and systems, allowing faults to be bubbled up and bad elements disabled independently.
The ADIRU does compare values to detect failures (median of 3 sensors), but they could only detect errors that last >1s. The flight computer used the raw data - because the sensors aren't interchangeable (they won't have consistent readings in all flight modes)!
Very nifty.
One thing, they say "memorisation period", I don't think it's a memorisation period? From my reading of the algorithm, it should be more "last value retention period"? Or "sensor spurious fault reading delay"?
Section 2.1 A330/A340 flight control system design
"AOA computation logic"
https://www.atsb.gov.au/sites/default/files/media/3532398/ao...
- The Aviation Herald has more technical details:
https://avherald.com/h?article=52f1ffc3&opt=0
by nickdothutton
2 subcomments
- I’d just like to point out that if you are in the computing industry long enough, you will get to see a few such incidents under different circumstances, not only in industries like aerospace. Mostly things like ECC save your a*, sometimes your software will be able to recognise a temporary spurious reading and disregard it because you had enough alternative checking logic, or in the case of realtime and safety critical maybe even your systems can take a vote between them. Got caught out by (cpu cache line) bit flips in the 90s, months of pain trying to track it down. Some of your will know :-)
- The aerospace industry has had countermeasures in place against bit-flips for a long time, oftentimes thanks to redudancy
Airbus/Thales's fix in this case appears to add more error checking, and to restart the misbehaving component.
https://bea.aero/fileadmin/user_upload/BEA2024-0404-BEA2025-...
("une supervision interne du composant à l’origine de la défaillance ;
- un mécanisme de redémarrage automatique de ce composant dès lors que la défaillance
est détectée)
- Has BoFesc vibes
"It's friday, so I get into work early, before lunch even. The phone rings. Shit!
I turn the page on the excuse sheet. "SOLAR FLARES" stares out at me. I'd better read up on that..."
by supernova87a
1 subcomments
- I wonder how the incident was diagnosed? Does the FDR record low level errors that might've contributed to this? I thought that it only recorded certain input parameters and high-level flight metrics but I'm no expert.
If a radiation event caused some bit-flip, how would you realize that's what triggered an error? Or maybe the FDR does record when certain things go wrong? I'm thinking like, voting errors of the main flight computers?
Anyway, would be very interested to know!
- There's a great postmortem here about what might have been a similar SEU (single event upset--bitflip) here: https://www.atsb.gov.au/sites/default/files/media/3532398/ao...
by rossjudson
0 subcomment
- My armchair guess is that they had a new control pathway not properly participating in their integrity hand-off protocols, doing some kind of transformation outside of that protection.
I once saw some HW engineers go nuts trying to find out why a storage device had an error rate several orders of magnitude higher than the extremely low error rate they expected (and triggering data corruption errors). It turns out to be one extremely deep VHDL-based control area for an FPGA that didn't properly do integrity. You'd have to flip a bit at an incredibly precise point in time for error to occur, but that's what was happening. When all the math was said and done, that FPGA control path integrity miss exactly accounted for the the higher error rate.
- We flew too close to the sun
by joelthelion
8 subcomments
- Do they really need to ground the entire fleet for that? One incident for ten thousand planes in the air for years. I'd think that giving airlines two months to fix it would be sufficient.
by owenthejumper
0 subcomment
- A friend works at Jetblue. They are scrambling hard to do the updates.
- I've noticed that some carriers seem to be suggesting that there might be no impact to flights, but isn't this an immediate grounding for each aircraft until the update is made?
How is it possible that this wouldn't impact upon flight schedules?
by 1970-01-01
0 subcomment
- They said the same thing at Toyota when the unintended accel problem was in the news, but never found a real world example. There are a lot more old Toyotas still on the road than Airbuses in the air, so distance to the sun makes all the difference here? I wonder if they only see issues when flying near the north pole?
- What if future aircraft had "OTA" updates to software... using this as an example of avoidable downtime.
OTA updates to cars makes me feel uneasy -- not knowing what new bugs it might introduce.
- This video shows the the A320 computer and how the computer cooling system works
https://www.youtube.com/watch?v=HQuc_HhW6VA
by nubinetwork
2 subcomments
- Why would a CME disrupt a single brand and model of aircraft, when the entire planet is covered in computers that almost never have bitflip issues when a CME rolls through every few months?
- From newspaper reporting on this, they are rolling back a software update. I wonder what was the original cause or the update? How often are flight computers software updated and why?
by ChrisArchitect
0 subcomment
- More discussion: https://news.ycombinator.com/item?id=46082296
- Intense solar radiation will be at a peak, since it is NOW the peak of the 11 year sunspot cycle.
Good related reading on this page ....
https://en.wikipedia.org/wiki/Radiation_hardening
... includes a range of mitigation effects.
( I would be interested to find out how they actually test these systems. What combinations of hardware hardening and software logic. ALso do they actually subject to system to radiation as part of the testing )
https://en.wikipedia.org/wiki/Radiation_hardening
by raverbashing
1 subcomments
- Apparently the fix is reverting to a previous version of the SW (see https://avherald.com/h?article=52f1ffc3&opt=0 )
Curious what a sw change might have done in terms of resiliency. Maybe an incorrect memory setting or some code path that is not calculating things redundantly maybe?
- Solar radiation like solar wind, or sunlight? They don’t say.
- This is in response to JetBlue flight 1230 from Cancun to Newark on October 30, 2025, where a cosmic ray of some kind flipped a bit and caused a dangerous situation. At the time there was a minor (G1) geomagnetic storm - meaning more cosmic rays than normal. The Planetary K-index was at 5. These are somewhat elevated numbers - enough to produce a visible Aurora in Canada, but probably not even the northernmost US. But also this level of space weather is also very common. We hit G1 or higher about once a week. That's the really damning part. If it had happened in a G4 or G5 storm, then the engineers might have responded "we can't fix everything", but this level of reliability is clearly unacceptable.
by rishabhaiover
0 subcomment
- I hope Airbus only uses Honeywell or Collins in their newer planes.
- Following the Airbus A320 emergency airworthiness action, everyone will be talking about the ELAC (Elevator Aileron Computer) manufactured by Thales, which caused a sudden pitch-down without pilot input on JetBlue 1230 back in October.
So here’s everything you need to know about ELAC.
The ELAC System in the Airbus A320: The Brains Behind Pitch and Roll Control
https://x.com/Turbinetraveler/status/1994498724513345637
- This is one of the rare cases where, IMO, it makes sense to use a modified title as you've done here.
- [flagged]
- I was traveling during this entire ordeal. My flight got delayed by 7 hours. Insane day, just now boarding my flight. American Airlines was in shambles today.