"0.05% of domains" is a vanity metric -- what matters is how many requests were mis-served cross-user. "Cache-Control was respected where provided" is technically true but misleading when most apps don't set it because CDN was off. The status page is more honest here too: they confirmed content without cache-control was cached.
They call it a "trust boundary violation" in the last line but the rest of the post reads like a press release. No accounting of what data was actually exposed.
- Why were they making CDN changes in prod? With their 100M funding recently they could afford a separate env to test CDN changes. Did their engineering team even properly understand surrogate keys to feel confident to roll out a change in prod? I don't think they're beating the AI allegations to figure out CDN configs, a human would not be this confident to test surrogate keys in prod.
- During and post-incident, the comms has been terrible. Initial blog post buried the lede (and didn't even have Incident Report in the title). They only updated this after negative feedback from their customers. I still get the impression they're trying to minimise this, it's pretty dodgy. As other comments mentioned, the post is vague.
- They didn't immediately notify customers about the security incident (people learned from their users). The apparently have emailed affected customers only, many hours after. Some people that were affected that still haven't been emailed, and they seem to be radio silent lately.
- Their founder on twitter keeps using their growth as an excuse for their shoddy engineering, especially lately. Their uptime for what's supposed to be a serious production platform is abysmal, they've clearly prioritised pushing features over reliability https://status.railway.com/ and the issues I've outlined here have little to do with growth, and more to do with company culture.
Honestly, I don't think railway is cut out for real production work (let alone compliance deployments), at least nothing beyond hobby projects.
Their forum is also getting heated, customers have lost revenue, had medical data leaked etc., with no proper followup from the railway team
https://station.railway.com/questions/data-getting-cached-or...
From the outside, it looks like "just a cache misconfiguration," but in reality, the problem is more insidious because it's distributed across multiple layers: - application logic (authentication limitations) - CDN behavior -> infrastructure - default settings that users rely on (no cache headers because the CDN was disabled)
The hardest part of debugging these cases isn't identifying what happened, but realizing where the model is flawed: everything appears correct locally, the logs don't report any issues, yet users see completely different data.
I've seen similar cases where developers spent hours debugging the application layer before even considering that something upstream was silently changing the behavior.
These are the kind of incidents where the debugging path is anything but linear.
There are dozens of contradictions, like first they say:
“this may have resulted in potentially authenticated data being served to unauthenticated users”
and then just a few sentences later say
“potentially unauthenticated data is served to authenticated users”
which is the opposite. Which one is it?
Am I missing something, or is this article poorly reviewed?
(also looks like two versions of the 'postmortem' are published at https://blog.railway.com/engineering)
I think this is their first major security incident. Good that they are transparent about it.
If possible (@justjake) it would be helpful to understand if there was a QA/test process before the release was pushed. I presume there was, so the question is why this was not caught. Was this just an untested part of the codebase?
I think that's already best practice in most API designs anyway?