FRESH

Hacker News

Home

A race condition in Aurora RDS

237 points by theanomaly

by gtowey

14 subcomments

This article seems to indicate that manually triggered failovers will always fail if your application tries to maintain its normal write traffic during that process.
Not that I'm discounting the author's experience, but something doesn't quite add up:
- How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?
- If they know, how is this not an urgent P0 issue for AWS? This seems like the most basic of basic usability features is 100% broken.
- Is there something more nuanced to the failure case here such as does this depend on transactions in-progress? I can see how maybe the failover is waiting for in-flight transactions to close and then maybe hits a timeout where it proceeds with the other part of the failover by accident. That could explain why it doesn't seem like the issue is more widespread.

by time0ut

1 subcomments

Wow. This is alarming.
We have done a similar operation routinely on databases under pretty write intensive workloads (like 10s of thousands of inserts per second). It is so routine we have automation to adjust to planned changes in volume and do so a dozen times a month or so. It has been very robust for us. Our apps are designed for it and use AWS’s JDBC wrapper.
Just one more thing to worry about I guess…

by shayonj

0 subcomment

Sadly, its not the first time I have noticed unexpected and odd behaviors from Aurora PostgreSQL offering.
I noticed another interesting (and still unconfirmed) bug with Aurora PostgreSQL around their Zero Downtime Patching.
During an Aurora minor version upgrade, Aurora preserves sessions across the engine restart, but it appears to also preserve stale per-session execution state (including the internal statement timer). After ZDP, I’ve seen very simple queries (e.g. a single-row lookup via Rails/ActiveRecord) fail with `PG::QueryCanceled: ERROR: canceling statement due to statement timeout` in far less than the configured statement_timeout (GUC), and only in the brief window right after ZDP completes.
My working theory is that when the client reconnects (e.g. via PG::Connection#reset), Aurora routes the new TCP connection back to a preserved session whose “statement start time” wasn’t properly reset, so the new query inherits an old timer and gets canceled almost immediately even though it’s not long-running at all.

by grhmc

1 subcomments

Yikes! This is exactly the kind of invariant I'd expect Aurora to maintain on my behalf. It is why I pay them so much...

by halifaxbeard

0 subcomment

I think OP is wrong in their hypothesis based on the logs they share and the root cause AWS support provided them.
I think the promotion fails to happen and then an external watchdog notices that it didn’t, and kills everything ASAP as it’s a cluster state mismatch.
The message about the storage subsystem going away is after the other Postgres process was kill -9’d.

by jansommer

8 subcomments

People who have experience with Aurora and RDS Postgres: What's your experience in terms of performance? If you dont need multi A-Z and quick failover, can you achieve better performance with RDS and e.g. gp3 64.000 iops and 3125 throughput (assuming everything else can deliver that and cpu/mem isn't the bottleneck)? Aurora seems to be especially slow for inserts and also quite expensive compared to what I get with RDS when I estimate things in the calculator. And what's the story on read performance for Aurora vs RDS? There's an abundance of benchmarks showing Aurora is better in terms of performance but they leave out so much about their RDS config that I'm having a hard time believing them.

by dangoodmanUT

0 subcomment

This confirms a lot of what their engineers preach: The lego brick model.
They made the storage layer in total isolation, and they made sure that it guaranteed correctness for exclusive writer access. When the upstream service failed to also make its own guarantees, the data layer was still protected.
Good job AWS engineering!

by d1egoaz

1 subcomments

> AWS has indicated a fix is on their roadmap, but as of now, the recommended mitigation aligns with our solution: use Aurora’s Failover feature on an as-needed basis and ensure that no writes are executed against the DB during the failover.
Is there a case number where we can reach out to AWS regarding this recommendation?

by robinduckett

1 subcomments

Glad to know I’m not crazy.

by bob1029

0 subcomment

> Aurora's architecture differs from traditional PostgreSQL in a crucial way: it separates compute from storage.
I find this approach very compelling. MSSQL has a similar thing with their hyperscale offering. It's probably the only service in Azure that I would actually use.

by redwood

2 subcomments

A good reminder of how people developing a mental model of adding read replicas as a way to scale is a slippery slope. At the end of the day you're scaling only one specific part of your system with certain consistency dynamics that are difficult to reason about

by halfmatthalfcat

0 subcomment

CC pm. MgtzkskskzjauHjhffd

0 subcomment

by almosthere

1 subcomments

probably should have added postgres to end of title

by ldkge

0 subcomment

Am I the only one who misread that as “AI race condition”?

by YouAreWRONGtoo

0 subcomment

[dead]