FRESH

Hacker News

Home

Decisions that eroded trust in Azure – by a former Azure Core engineer

828 points by axelriet

by branko_d

11 subcomments

I think this is especially problematic (from Part 4 at https://isolveproblems.substack.com/p/how-microsoft-vaporize...):
"The team had reached a point where it was too risky to make any code refactoring or engineering improvements. I submitted several bug fixes and refactoring, notably using smart pointers, but they were rejected for fear of breaking something."
Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs, without shipping any new features. This can take a long time, and cannot happen without the full support from the management who do not fully understand the problem nor are incentivized to understand it.

by breppp

0 subcomment

I highly sympathize with the author and as a former user of Azure I agree it's a terrible mess.
However, the author has committed magnificent career suicide. If you are in a dysfunctional environment you don't go from issue to issue and escalate each one, proactively finding problematic issues.
You rather find the underlying issues (e.g. crashes not assigned) prioritize them and fix them.
By constantly whistle blowing on separate issues to as high as the board, he is not trying to improve by evolution but by revolution and in revolutions heads roll

by yoyohello13

8 subcomments

I don't know if any of this is true, but as a user of Azure every day this would explain so much.
The Azure UI feels like a janky mess, barely being held together. The documentation is obviously entirely written by AI and is constantly out of date or wrong. They offer such a huge volume of services it's nearly impossible to figure out what service you actually want/need without consultants, and when you finally get the services up who knows if they actually work as advertised.
I'm honestly shocked anything manages to stay working at all.

by vintagedave

13 subcomments

What are we reading here? These are extraordinary statements. Also with apparent credibility. They sound reasonable. Is this a whistleblower or an ex employee with a grudge? The appearance is the first. Is it? They’ve put their name to some clear and worrying statements.
> On January 7, 2025… I sent a more concise executive summary to the CEO. … When those communications produced no acknowledgment, I took the customary step of writing to the Board through the corporate secretary.
Why is that customary? I have not come across it, and though I have seen situations of some concern in the past, I previously had little experience with US corporate norms. What is normal here for such a level of concern?
More, why is this public not a court case for wrongful termination?
Is Azure really this unreliable? There are concrete numbers in this blog. For those who use Azure, does it match your external experience?

by petterroea

3 subcomments

A business man at a prior employer sympathetic with my younger, naive "Microsoft sucks" attitude told me something I remember to this day:
Microsoft is not a software company, they have never been experts at software. They are experts at contracts. They lead because their business machine exceeds at understanding how to tick the boxes necessary to win contract bids. The people who make purchasing decisions at companies aren't technical and possibly don't even know a world outside Microsoft, Office, and Windows, after all.
This is how the sausage is made in the business world, and it changed how I perceived the tech industry. Good software (sadly) doesn't matter. Sales does.
This is why most of Norway currently runs on Azure, even though it is garbage, and even though every engineer I know who uses it says it is garbage. Because the people in the know don't get to make the decision.

by Anon1096

17 subcomments

The post is so dramatized and clearly written by someone with a grudge such that it really detracts from any point that is trying to be made, if there is any.
From another former Az eng now elsewhere still working on big systems, the post gets way way more boring when you realize that things like "Principle Group Manager" is just an M2 and Principal in general is L6 (maybe even L5) Google equivalent. Similarly Sev2 is hardly notable for anyone actually working on the foundational infra. There are certainly problems in Azure, but it's huge and rough edges are to be expected. It mostly marches on. IMO maturity is realizing this and working within the system to improve it rather than trying to lay out all the dirty laundry to an Internet audience that will undoubtedly lap it up and happily cry Microslop.
Last thing, the final part 6 comes off as really childish, risks to national security and sending letters to the board, really? Azure is still chugging along apparently despite everything being mentioned. People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.

by Manouchehri

5 subcomments

I've seen Azure OpenAI leak other customer's prompt responses to us under heavy load.
https://x.com/DaveManouchehri/status/2037001748489949388
Nobody seems to care.

by pRusya

1 subcomments

It's a nice read. Thank you for sharing this.
> Microsoft, meanwhile, conducted major layoffs—approximately 15,000 roles across waves in May and July 2025 —most likely to compensate for the immediate losses to CoreWeave ahead of the next earnings calls.
This is what people should know when seeing massive layoffs due to AI.

by schlauerfox

3 subcomments

"For fiscal 2025, Microsoft CEO Satya Nadella earned total pay of $96.5 million, up 22% from a year earlier." -CNBC.com
and
"I also see I have 2 instances of Outlook, and neither of those are working." -Artemis II astronaut

by OldOneEye

0 subcomment

Some previous colleague of mine has to work with Azure on their day to day, and everything explained in this article makes a lot of sense when I get to hear about their massive rantings of the platform.
12 years ago I had to choose whether to specialize myself in AWS, GCP or Azure, and from my very brief foray with Azure I could see it was an absolute mess of broken, slow and click-ops methodology. This article confirms my suspicions at that time, and my colleague experience.

by CalRobert

8 subcomments

What makes anyone start a new project and think “I know, I’ll use Azure!”? I really don’t get it. Do they have a great sales org? Is it because a phb thinks “well they made Office so it must be good”?
I interviewed with a Dutch energy company migrating infra from AWS -to- Azure and I have no idea what would make them do that (aside from inertia, but then why use Azure in the first place?)
And for some reason Azure usage is rampant in Europe.

by einrealist

0 subcomment

Axel's engagement with the issue and refusal to give up is admirable. It also demonstrates that code and architecture remain important even in an era when managers believe these subjects can now be handled by LLMs. Imagine if LLMs were mandated for use in such an environment, further distancing SWEs from the code and overarching architectural choices. I am not saying that it can't work. But friction and maturity through experience really matters.
Also explains perfectly why I never met an engineer who was eager to run workloads on Azure. In orgs I worked, either the use of Azure was mandated by management (probably good $$ incentives) or through Microsoft leaning into the "Multi-Cloud for resilience" selling point, to get Orgs shift workloads from competitors.
Its also huge case for open (cloud) stack(s).

by nope1000

7 subcomments

> The direct corollary is that any successful compromise of the host can give an attacker access to the complete memory of every VM running on that node. Keeping the host secure is therefore critical.
> In that context, hosting a web service that is directly reachable from any guest VM and running it on the secure host side created a significantly larger attack surface than I expected.
That is quite scary

by _pdp_

5 subcomments

The personal account makes a lot of sense, although I could easily see why the OP was not successful. Even if you are an excellent engineer, making people do things, accept ideas, and in general hear you requires a completely different skill altogether - basically being a good communicator.
The second thing is that this series of blog posts (whether true or not, but still believable) provides a good introduction to vibe coders. These are people who have not written a single line of code themselves and have not worked on any system at scale, yet believe that coding is somehow magically "solved" due to LLMs.
Writing the actual code itself (fully or partially) maybe yes. But understanding the complexity of the system and working with organisational structures that support it is a completely different ball game.

by lokar

0 subcomment

This reads pretty bad, and I believe it was. I worked on (and was at least partly responsible for) systems that do the same thing he described. It took constant force of will, fighting, escalation, etc to hold the line and maintain some basic level of stability and engineering practice.
And I've worked other places that had problems similar to the core problems described, not quite as severe, and not at the same scale, but bad enough to doom them (IMO) to a death loop they won't recover from.

by ludwigvan

3 subcomments

I had the misfortune of having to use Azure back in 2018 and was appalled at the lack of quality, slowness. I was in GitHub forums, helping other customers suffering from lack of basic functionality, incredible prices with abysmal performance. This article explains a lot honestly.
Google’s Cloud feels like the best engineered one, though lack of proper human support is worrying there compared to AWS.

by luke5441

1 subcomments

On a leadership level it seems problematic that they ghosted the feedback. Direcly this leads to people like Axel who feel ownership of the problem to break NDAs and create company harming posts. In my experience they at least respond with corp speak platitudes meaning that they got the feedback and don't understand it or ignore it, but have been taught to always ask for feedback and answer it (but incentives are to ask for feedback, then ignore it).

by ChicagoDave

0 subcomment

I was a career Microsoft stack developer until Azure. Comparing it to AWS immediately forced me to make a decision to move away from their stack and towards AWS.
Just the networking and security infrastructure was complete trash compared to how those things worked in AWS.
Not one regret in my decision.

by kshri24

0 subcomment

The only time I used Azure was for setting up Microsoft as a provider for authentication. Put me through a never-ending loop of asking for a Government of India issued document that was already submitted. Human support was non-existent. Decided never to use Azure in any product after that horrible experience.
If you cannot even get auth right I shudder to think what the rest of the product will be like to deal with should issues arise.

by nosefrog

0 subcomment

We run 1000s of machines in Azure. It's garbage. Very few features work. Nodes are always having strange issues, especially on the networking side. And the worst part is that Azure support has 0 interest in actually debugging things. We just got out of an outage today caused by the insanely slow SSDs that they attach to their postgres dbs by default.

0 subcomment

by arccy

2 subcomments

from part 2:
> Worse, early prototypes already pulled in nearly a thousand third-party Rust crates, many of which were transitive dependencies and largely unvetted, posing potential supply-chain risks.
Rust really going for the node ecosystem's crown in package number bloat

by andyjohnson0

1 subcomments

This reads like it was written by the Cleverest Person in the Room. I have to use Azure Devops at work, and some of the critique of Azure rings true for me, but the author-centric presentation was quite off-putting.

by throwawayslop12

1 subcomments

Power Platform is of the same quality, I’d avoid it if possible.
I was a principal engineer in the Power Platform org and it always felt like a disorganized mess. Multiple reorganizations per year, changing priorities and service ownership.

by egorfine

0 subcomment

> Cutler’s intent was to produce a system with the same level of quality, unshakable reliability, and attention to detail he was famous for in his work on VMS and NT.
I'm not sure whether this is serious or irony.

by egorfine

0 subcomment

> was maintaining in-memory caches containing unencrypted tenant data, all mixed in the same memory areas, in violation of all hostile multi-tenancy security guidelines
Splitting caches to different isolated memory areas will not make shareholders happy, will not lead to promotion and will not even move the project forward.
Simply put, designing secure software is detrimental in that environment.

by Frannky

2 subcomments

I was always very curious why people are using azure. Clunky difficult to setup and crazy prices. I know a person being very happy with them because of the credits they gave it to him. I felt I probably don't have a model that explains what is going on there and that would be cool to know why people pay them vs the competion

by alex0ptr

0 subcomment

I’ve been working with Azure and Azure Germanyfor the past years and have a strong history with AWS.
I cannot count how many times disks were not attaching during AKS rescheduling. We build polling where we polled Entra Id for minutes until it became “eventually” consistent - not trusting a service principal until it was fetched at least one minute consistently. The slowness of Azure Functions was unbearable. On Azure germany IoT Hubs had to be “rebooted” by support constantly - which was a shocking statement in itself. The docs always lying or leaving out critical parts. The whole Premium vs Standard stuff is like selling windows licenses. The role model and UI is absolutely inconsistent.
The stability, consistency of IAM, and speed of AWS in comparison makes me truly wonder how anyone stays with Azure. One reason might be that the Windows instances are significantly cheaper though..

by outworlder

0 subcomment

Well, part 3 at least explains something I've observed; the platform is incredibly unstable. The same calls, with the same parameters, will often randomly fail with HTTP 400 errors, only to succeed later(hopefully without involving support). That made provisioning with terraform a nightmare.
I won't even dive too much into all the braindead decisions. Mixing SKUs often isn't allowed if some components are 'premium' and others are not, and not everything is compatible with all instances. In AWS, if I have any EBS volume I can attach it to any instance, even if it is not optimal. There's no faffing about "premium SKUs". You won't lose internet connectivity because you attached a private load balancer to an instance. Etc...
At my company, I've told folks that are trying to estimate projects on Azure to take whatever time they spent on AWS or GCP and multiply by 5, and that's the Azure estimate. A POC may take a similar amount of time as any other cloud, but not all of the Azure footguns will show themselves until you scale up.

by manmal

0 subcomment

For some reason, MS is still doing well. I’m not sure what conclusions I should draw from that, other than big businesses are hard to kill?

by auggierose

0 subcomment

I tried to use Azure once (more than 5 years ago), and the signing up kept crashing on me for hours. Never used it again since then. Some things are obvious.

by abtinf

0 subcomment

So this is why GitHub is having so many problem…

by bradleyankrom

1 subcomments

My most memorable anecdote from working in Azure is that they had two products named Purview and the internal MS people I talked to never figured out which one I was trying to use.

by gnabgib

1 subcomments

Title: How Microsoft Vaporized a Trillion Dollars

by xxxboxxx

0 subcomment

We signed up to go all-in on Azure because our CEO got an xbox to take home to his kids.

by Dansvidania

0 subcomment

I have been in a Microsoft adjacent company (meaning lots of people bounced to and from Microsoft to it) and all this makes a lot of sense. The almost ideological “everything in house” and politically oriented philosophy they had fits like a glove. Some of the ex Microsoft people hated it, some of them missed it. But the picture they made was pretty bleak.
Given how windows is going what’s described in the article doesn’t seem so shocking either. Even though they need not be correlated products, I can’t help but seeing a similar shortsightedness in the playbooks they are adopting.

by jamwhite

2 subcomments

This read was a blast from the past. I'm not going to comment on much from OP and instead give a little of my experience there.
Straight out of college in 2017 I joined the Compute Fabric Controller (FC) org as a SWE on an absolutely wonderful team that dealt with mostly container management, VM and Host fault handling & repair policies, and Fabric to Host communication with most of our code in the FC. I drove our team's efforts on the never ending "Unhealthy" node workstream, the final catch-all bucket in the Host fault handler mentioned in OP. I also did heavy work in optimizing repair policies, reactive diagnostics for improved repairs and offline analysis, OS and HW telemetry ingestion from the Host like SEL events into the repair manager in real time, wrote the core repair manager state machine in the new AZ level service that we decoupled from the Fabric, drove Kernel Soft Reboot (KSR)/Tardigrade as a repair action for minimal VM impact for some host repairs, and helped stand up into eventually owning a new warm path RCA attribution service to help drive the root underlying causes of reliability issues and feed some offline analysis back into the live repair manager.
The work was difficult but also really really interesting. For example, Balancing repair policies around reliability is tricky. There's a constant fight in repair policies in grey situations between minimizing total VM downtime vs any VM interruptions/reboots/heals at all, because the repair controller doesn't have perfect information. If telemetry is pointing to VMs being degraded or down on the host, yet in reality they're not, we are the ones inducing the VM downtime by performing an impactful repair. If we wait a little while before taking an impactful repair action, it may be a transient issue that will resolve itself in the moment, at which point we can do much less impactful repairs after like Live Migration if the host is healthy enough. On the flipside, if some telemetry is saying the VMs are up yet they're down in reality and we just don't know it yet, taking time to collect diagnostics and then take a repair action(s) leads to only more overall total downtime.
When I joined in 2017 our team was 7 or 8 people including myself, yet had enough work for at least double that amount of people. On-call was a nightmare the first 2 years. Building Azure back then was like trying to build a car while already sitting behind the steering wheel of that car as it was already barreling down the highway. Everyone on my immediate team the first couple years were a joy to work with, highly competent, hard working, and all of us working absurd hours. For me 60hrs/wk was avg, with many weeks ~80 and a few weeks ~100. Other than the hours though, it was a splendid team environment and I'd like to think we had good engineering culture within our team, though maybe I'm biased. Engineering culture and quality did however vary substantially between orgs and teams. We were heavily under resourced and always needed more headcount, as did nearly every other team in Azure Compute. That never changed during my tenure even though my team's size ballooned to ~20 by 2020, and eventually big enough to where we had to split the team. There was high turnover from the lack of headcount and overwork which was somewhat alleviated by lowering the hiring bar... which obviously opened up another can of worms. This resourcing issue might explain, in part, why Azure is the way it is. We were always playing catchup as a result of the woes of chronic understaffing for years. I eventually burnt out which turned into spiraling mental health, physical health issues, constant panic attacks, and then a full blown mental health crisis after which I took LOA and eventually left the company. I came back briefly for a bit during LOA, and learned that the RCA service I'd built with the help of a coworker (who also left Azure) and was only a small part of our overall workload, had turned into a full fledged team of 9 people dedicated to working on that service in my absence. I know that stating some of this might affect my employment in the future but I don't really care. I know I'm not alone in experiencing burnout working in Azure. It wasn't my manager's fault either, he was amazing. He'd often ask and I would incorrectly yet confidently reassure him that I wasn't burning out but I simply didn't notice the signs. Things are better now though and I'm just happy to be here.
Kudos to the many brilliant people I worked alongside there, I hope you're all doing great.

by physhster

0 subcomment

Thanks for that, now I have a rock-solid argument when people say "oh we're already Microsoft customers, we'll just use Azure, it's easier, and they have Active Directory!!"

by Bjartr

0 subcomment

What a fascinating view into how the sausage is made

by plantain

6 subcomments

I just do not understand how Azure has the scale it does. You only need to login and click around for a bit to see this is not a coherent system designed by competent people. Let alone try and actually build something on it.
Who are the customers? Who is buying this shit?

by thelastgallon

1 subcomments

> That entire 122-strong org was knee-deep in impossible ruminations involving porting Windows to Linux to support their existing VM management agents.
> My day-one problem was therefore not to ramp up on new technology, but rather to convince an entire org, up to my skip-skip-level, that they were on a death march.
> I later researched this further and found that no one at Microsoft, not a single soul, could articulate why up to 173 agents were needed to manage an Azure node
This is most corporates. I'm sure this was celebrated as as a successful project and congratulations to everyone, along with big bonuses, RSU, raises, and promotions, mostly to other orgs to bring this kind of 'success' to other projects (or other companies). These people mostly are gone in less than 2 years. They continue to take 'wins'.
The VPs are dumb as shit, but they need 'successful' projects that have fancy names that they can present to their exec team.
The 173 agents are to give wins to a large number of people and teams, all these people contributed to this successful project.
If it continues, there will be a lessons learned powerpoint, followed by 10x growth in headcount, promotions to everyone and double down. 270 people can deliver a baby in 1 day and all that.

by h6d_100c

0 subcomment

This makes it extra silly to trust that Github won't train on your private repos, if they haven't already - just by accident

by debarshri

0 subcomment

Substack is having its moment. First, deepdelver, now this.

by sakesun

1 subcomments

> Few engineers could reliably build the software locally
I've just listened to Longhorn story on Monday and have heard the same thing.

by Yoofie

0 subcomment

> Few engineers could reliably build the software locally; debugger usage was rare (I ended up writing the team's first how-to guide in 2024); and automated test coverage sat below 40%.
A key clue and explains why so much of what Microsoft puts out is garbage. Wow.

0 subcomment

by purpleidea

0 subcomment

Microsoft Azure has always been a clown show. I've found so many obvious bugs. The quality is not there and never will be. No serious companies rely on it. Use virtually any other vendor or host it yourself.

0 subcomment

by acedTrex

0 subcomment

This is an insanely blunt look into some serious issues with microsoft.

by physhster

0 subcomment

This reads like Google culture too...

0 subcomment

by goodpoint

0 subcomment

"isolveproblems", really?

by momo_dev

0 subcomment

i run fastapi APIs on linode with cloudflare in front and honestly the simplicity is underrated. predictable billing, docs that match reality, no surprise platform regressions. for a straightforward API workload the hyperscaler tax doesn't make sense unless you genuinely need their scale

0 subcomment

by jimbobimbo

0 subcomment

When things must be shipped quickly, shit breaks and corners are cut; large orgs are full of disfunction. Not sure if such insight was worth of setting your own career on fire.

by g_host

1 subcomments

"The company formalized the idea that defects could be fixed through human intervention on live production systems" (From Part 5).
Uh...yeah. I think we all realized that years ago.

by vsgherzi

1 subcomments

I've said it before and I'll say it again. I'm glad rust has good package management I really am. However given that aspect, it ends up forming a dependency heavy culture. In situations like this it's hard to use dependencies due to the amount of transitive dependencies some items pull in. I really which this would change. Of course this is a social problem so I don't expect a good answer to come of this....

by nalekberov

0 subcomment

At this point, it’s very clear that people nowadays choose Rust mostly to be part of the cult rather than clearly understanding its shortcomings and advantages over languages such as C, C++. It has gotten to the point that some devs after watching a YouTube video criticizing C++ for two hours, announce C++ the worst programming language. Unfortunately, such people become decision makers at giant tech companies too.

by g_host

0 subcomment

"The company formalized the idea that defects could be fixed through human intervention on live production systems"
Uh...yeah. I think we all realized that years ago.

by brcmthrowaway

1 subcomments

What an epic takedown.
Microsoft should have promoted this guy instead of laying him off.
Did Microsoft really lose OpenAI as a customer?

by xyst

0 subcomment

til: there’s individuals/people that "trusted" azure at all
I only used that shit platform because some Microsoft consultant convinced idiotic C-suite that Azure was the future.

by axelriet

3 subcomments

A former Azure Core engineer’s 6-part account of the technical and leadership decisions that eroded trust in Azure.

by summitwebaudit

0 subcomment

[dead]

by patrickRyu

0 subcomment

[flagged]

by ryguz

0 subcomment

[dead]

by JackSmith_YC

0 subcomment

[dead]

by gamblor956

1 subcomments

TLDR: It turns out that Nadella despite being an engineer is actually quite bad at managing engineering. Who would have thought?

0 subcomment

by ok123456

0 subcomment

[flagged]

by shmoil

1 subcomments

[flagged]

by andrewstuart

1 subcomments

Any complex system - and these cloud systems must be immensely complex - accumulate cruft and bloat and bugs until the entire thing starts to look like an old hotel that hasn’t been renovated in 30 years.

by pavlov

1 subcomments

The first couple of paragraphs felt like a parody of a guy who goes to a diner and gets upset the waitress doesn’t address him as Dr.
It didn’t get any better.

by elankart

1 subcomments

I've worked in Windows for many many years, no idea who this guy is. He is randomly name dropping. He wants attention.