Not that I'm discounting the author's experience, but something doesn't quite add up:
- How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?
- If they know, how is this not an urgent P0 issue for AWS? This seems like the most basic of basic usability features is 100% broken.
- Is there something more nuanced to the failure case here such as does this depend on transactions in-progress? I can see how maybe the failover is waiting for in-flight transactions to close and then maybe hits a timeout where it proceeds with the other part of the failover by accident. That could explain why it doesn't seem like the issue is more widespread.
We have done a similar operation routinely on databases under pretty write intensive workloads (like 10s of thousands of inserts per second). It is so routine we have automation to adjust to planned changes in volume and do so a dozen times a month or so. It has been very robust for us. Our apps are designed for it and use AWS’s JDBC wrapper.
Just one more thing to worry about I guess…
I noticed another interesting (and still unconfirmed) bug with Aurora PostgreSQL around their Zero Downtime Patching.
During an Aurora minor version upgrade, Aurora preserves sessions across the engine restart, but it appears to also preserve stale per-session execution state (including the internal statement timer). After ZDP, I’ve seen very simple queries (e.g. a single-row lookup via Rails/ActiveRecord) fail with `PG::QueryCanceled: ERROR: canceling statement due to statement timeout` in far less than the configured statement_timeout (GUC), and only in the brief window right after ZDP completes.
My working theory is that when the client reconnects (e.g. via PG::Connection#reset), Aurora routes the new TCP connection back to a preserved session whose “statement start time” wasn’t properly reset, so the new query inherits an old timer and gets canceled almost immediately even though it’s not long-running at all.
I think the promotion fails to happen and then an external watchdog notices that it didn’t, and kills everything ASAP as it’s a cluster state mismatch.
The message about the storage subsystem going away is after the other Postgres process was kill -9’d.
They made the storage layer in total isolation, and they made sure that it guaranteed correctness for exclusive writer access. When the upstream service failed to also make its own guarantees, the data layer was still protected.
Good job AWS engineering!
Is there a case number where we can reach out to AWS regarding this recommendation?
I find this approach very compelling. MSSQL has a similar thing with their hyperscale offering. It's probably the only service in Azure that I would actually use.