Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years. I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.


>Yes it does happen but very rarely to the tune of a few hours every 5-10 years.

It is rare, but it happens at LEAST 2-3x a year. AWS us-east-1 has a major incident affecting multiple services (that affect most downstream aws services) multiple times a year. Usually never the same root cause.

Not very many people realize that there are some services that still run only in us-east-1.


Call it the aws holiday. Most other companies will be down anyway. It's very likely that your company can afford to be down for a few hours, too.


imagine if the electricity supplier too that stance.


That's the wrong analogy though. We're not talking about the supplier - I'm sure Amazon is doing its damnedest to make sure that AWS isn't going down.

The right analogy is to imagine if businesses that used electricity took that stance, and they basically all do. If you're a hospital or some other business where a power outage is life or death, you plan by having backup generators. But if you're the overwhelming majority of businesses, you do absolutely nothing to ensure that you have power during a power outage, and it's fine.


> But if you're the overwhelming majority of businesses, you do absolutely nothing to ensure that you have power during a power outage, and it's fine.

it is fine because the electricity supplier is so good today that people don't see it going down as a risk.

Look at south africa's electricity supplier for a different scenario.


But that is the stance for a lot of electrical utilities. Sometimes weather or a car wreck takes out power and since its too expensive to have spares everywhere, sometimes you have to wait a few hours for a spare to be brought in


No, that's not the stance for electrical utilities (at least in most developed countries, including the US): the vast majority of weather events cause localized outages (the grid as a whole has redundancies built in; distribution to (residential and some industrial) does not. It expects failures of some power plants, transmission lines, etc. and can adapt with reserve power, or, in very rare cases by partial degradation (i.e. rolling blackouts). It doesn't go down fully.


Spain and Portugal had a massive power outage this spring, no?


Yeah, and it has a 30 page Wikipedia article with 161 sources (https://en.wikipedia.org/wiki/2025_Iberian_Peninsula_blackou...). Does that seem like a common occurrence?


> Sometimes weather or a car wreck takes out power

Not really? Most of the infrastructure is quite resilient and the rare outage is usually limited to a street or two, with restoration time mainly determined by the time it takes the electricians to reach the incident site. For any given address that's maybe a few hours per decade - with the most likely cause being planned maintenance. That's not a "spares are too expensive" issue, that's a "giving every home two fully independent power feeds is silly" issue.

Anything on a metro-sized level is pretty much unheard of, and will be treated as serious as a plane crash. They can essentially only be caused by systemic failure on multiple levels, as the grid is configured to survive multiple independent failures at the same time.

Comparing that to the AWS world: individual servers going down is inevitable and shouldn't come as a surprise. Everyone has redundancies, and an engineer accidentally yanking the power cables of an entire rack shouldn't even be noticeable to any customers. But an entire service going down across an entire availability zone? That should be virtually impossible, and having it happen regularly is a bit of a red flag.


I think this is right, but depending on where you live, local weather-related outages can still not-infrequently look like entire towns going dark for a couple days, not streets for hours.

(Of course that's still not the same as a big boy grid failure (Texas ice storm-sized) which are the things that utilities are meant to actively prevent ever happening.)


The electric grid is much more important than most private sector software projects by an order of magnitude.

Catastrophic data loss or lack of disaster recovery kills companies. AWS outages do not.


What if the electricity grid depends on some AWS service?


That would be circular dependency.


The grid actually already has a fair number of (non-software) circular dependencies. This is why they have black start [1] procedures and run drills of those procedures. Or should, at least; there have been high profile outages recently that have exposed holes in these plans [2].

1. https://en.wikipedia.org/wiki/Black_start 2. https://en.wikipedia.org/wiki/2025_Iberian_Peninsula_blackou...


And?


It doesn't though? Weird what if


You'd be surprised. See. GP asks a very interesting question. And some grid infra indeed relies on AWS, definitely not all of it but there are some aspects of it that are hosted by AWS.


I worked for an energy tech startup that did consulting for big utility companies. It absolutely does.


do you know for sure? And if not yet, you can bet someone will propose it in the future. So not a weird what if at all


This is already happening. I have looked at quite a few companies in the energy space this year, two of them had AWS as a critical dependency in their primary business processes and that could definitely have an impact on the grid. To their defense: AWS presumably tests their fall-back options (generators) with some regularity. But this isn't a farfetched idea at all.


Isn't that basically Texas?


Texas is like if you ran your cloud entirely in SharePoint.


Let's not insult SharePoint like that.

It's like if you ran you cloud on an old dell box in your closet while your parent company is offering to directly host it in AWS for free.


Also, every time your cloud went down, the parent company begged you to reconsider, explaining that all they need you to do is remove the disturbingly large cobwebs so they can migrate it. You tell them that to do so would violate your strongly-held beliefs, and when they stare at you in bewilderment, you yell “FREEDOM!” while rolling armadillos at them like they’re bowling balls.


Fortunately nearly all services running on AWS aren't as important as the electric utility, so this argument is not particularly relevant.

And regardless, electric service all over the world goes down for minutes or hours all the time.


Utility companies do not have redundancy for every part of their infrastructure either. Hence why severe weather or other unexpected failures can cause loss of power, internet or even running water.

Texas has had statewide power outages. Spain and Portugal suffered near-nationwide power outages last year. Many US states are heavily reliant on the same single source for water. And remember the discussions on here about Europe's reliance on Russian gas?

Then you have the XKCD sketch about how most software products are reliant on at least one piece of open source software that is maintained by a single person as a hobby.

Nobody likes a single point of failure but often the costs associated with mitigating that are much greater than the risks of having that point of failure.

This is why "risk assessments" are a thing.


> Hence why severe weather or other unexpected failures can cause loss of power, internet or even running water.

Not all utility companies have the same policies, but all have a resiliency plan to avoid blackout that is a bit more serious than "Just run it on AWS".


> Not all utility companies have the same policies, but all have a resiliency plan to avoid blackout that is a bit more serious than "Just run it on AWS".

You're arguing as if "run it on AWS" was a decision that didn't undergo the same kinds of risk assessment. As someone who's had to complete such processes (and in some companies, even define them), I can assure you that nobody of any competency runs stuff on AWS complacently.

In fact running stuff with resilience in AWS isn't even as simple as "just running it in AWS". There's a whole plethora of things to consider, and each with its own costs attached. As the meme goes "one does not simply just run something on AWS"


> nobody of any competency runs stuff on AWS complacently.

I agree with this. My point is simply that we, as an industry, are not a very competent bunch when it comes to risk management ; and that's especially true when compared to TSOs.

That doesn't mean nobody knows what they do in our industry or that shit never hits the fan elsewhere, but I would argue that it's an outlier behaviour, whereas it's the norm in more secure industries.

> As the meme goes "one does not simply just run something on AWS"

The meme has currency for a reason, unfortunately.

---

That being said, my original point was that utilities losing clients after a storm isn't the consequence of bad (or no) risk assessment ; it's the consequence of them setting up acceptable loss thresholds depending on the likelihood of an event happening, and making sure that the network as a whole can respect these SLOs while strictly respecting safety criteria.


Nobody was suggesting that loss of utilities is a result of bad risk management. We are saying that all competent businesses run risk management and for most businesses, the cost of AWS being down is less than the cost of going multi cloud.

This is particularly true when Amazon hand out credits like candy. So you just need to moan to your AWS account manager about the service interruption and you’ll be covered.


> imagine if the electricity supplier too that stance.

Imagine if the cloud supplier was actually as important as the electricity supplier.

But since you mention it, there are instances of this and provisions for getting back up and running:

* https://en.wikipedia.org/wiki/2025_Iberian_Peninsula_blackou...

* https://en.wikipedia.org/wiki/Northeast_blackout_of_2003


and how many times have aws gone down majorly like that? I think you wouldn't be able to count it.


* https://en.wikipedia.org/wiki/Timeline_of_Amazon_Web_Service...

As someone who lives in Ontario, Canada, I got hit by the 2003 grid outage, which is once in >20 years. Seems like a fairly good uptime to me.

(Each electrical grid can perhaps be considered analogous to a separate cloud provider. Or perhaps, in US/CA, regions:

* https://en.wikipedia.org/wiki/North_American_Electric_Reliab...

)


It happens 2-3x a year during peacetime. Tail events are not homogeneously distributed across time.


Well technically AWS has never failed in wartime.


I don't understand, peacetime?


Peacetime = When not actively under a sustained attack by a nation-state actor. The implication being, if you expect there to be a “wartime”, you should also expect AWS cloud outages to be more frequent during a wartime.


Don't forget stuff like natural disasters and power failures...or just a very adventurous squirrel.

AWS (over-)reliance is insane...


What about being actively attacked by multinational state or an empire? Does it count or not?

Why people keep using "nation-state" term incorrectly in HN comments is beyond me...


I think people generally mean "state", but in the US-centric HN community that word is ambiguous and will generally be interpreted the wrong way. Maybe "sovereign state" would work?


As someone with a political science degree whose secondary focus was international relations, "Nation-state" has a number of different, definitions, an (despite the fact that dictionaries often don't include it), one of the most commonly encountered for a very long time has been "one of the principle subjects of international law, held to possess what is popularly, but somewhat inaccuratedly, referred to as Westphalian sovereignty" (there is a historical connection between this use and the "state roughly correlating with single nation" sense that relates to the evolution of “Westphalian sovvereignty” as a norm, but that’s really neither here nor there, because the meaning would be the meaning regardless of its connection to the other meaning.)

You almost never see the definition you are referring used except in the context of explicit comparison of different bases and compositions of states, and in practice there is very close to zero ambiguity which sense is meant, and complaining about it is the same kind of misguided prescriptivism as (also popular on HN) complaining about the transitive use of "begs the question" because it has a different sense than the intransitive use.


It sounds more technical than “country” and is therefore better


To me it sounds more like saying regime instead of government, gives off a sense of distance and danger.


Not really: nations state level actor: a hacker group funded by a country, not necessarily directly part of that country's government but at the same time kept at arms length for deniability purposes. For instance, hacking groups operating from China, North Korea, Iran and Russia are often doing this with the tacit approval and often funding from the countries they operate in, but are not part of the 'official' government. Obviously the various secret services in so far as they have personnel engaged in targeted hacks are also nation state level actors.


It could be a multinational state actor, but the term nation-state is the most commonly used, regardless of accuracy. You can argue over whether of not the term itself is accurate, but you still understood the meaning.


It makes a lot more sense if they had a typo of peak


Its a different kind of outage when the government disconnects you from the internet. Happens all the time, just not yet in the US.


> Not very many people realize that there are some services that still run only in us-east-1.

The only ones that you're likely to encounter are IAM, Route53, and the billing console. The billing console outage for a few hours is hardly a problem. IAM and Route53 are statically stable and designed to be mostly stand-alone. They are working fine right now, btw.

During this outage, my infrastructure on AWS is working just fine, simply because it's outside of us-east-1.

Ironically, our observability provider went down.


> there are some services that still run only in us-east-1.

What are those ?


I would take the opposite view, the little AWS outages are an opportunity to test your disaster recovery plan, which is worth doing even if it takes a little time.

It’s not hard to imagine events that would keep AWS dark for a long period of time, especially if you’re just in one region. The outage today was in us-east-1. Companies affected by it might want to consider at least geographic diversity, if not supplier diversity.


> Companies affected by it might want to consider at least geographic diversity, if not supplier diversity.

Sure, it's worth considering, but for most companies it's not going to be worth the engineering effort to architect cross-cloud services. The complexity is NOT linear.

IMO most shops should focus on testing backups (which should be at least cross-cloud, potentially on-prem of some sort) to make sure their data integrity is solid. Your data can't be recreated, everything else can be rebuilt even if it takes a long time.


> I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.

Absurd claim.

Multi-region and multi-cloud are strategies that have existed almost as long as AWS. If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.


It’s not absurd, I’ve seen it happen. Company executes on their DR plan due to AWS outage, AWS is back before DR is complete, DR has to be aborted, service is down longer than if they’d just waited.

Of course there are cases when multi cloud makes sense, but it is in the minority. The absurd claim is that most companies should worry cloud outages and plan for AWS to go offline for ever.


> If your company is in anything finance-adjacent or critical infrastructure

GP said:

> most companies

Most companies aren't finance-adjacent or critical infrastructure


> If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing

That still fits in with "almost guarantee". It's not as though it's true for everyone, e.g. people who might trigger DR after 10 minutes of downtime, and have it up and running within 30 more minutes.

But it is true for almost everyone, as most people will trigger it after 30 minutes or more, which, plus the time to execute DR, is often going to be far less than the AWS resolution time.

Best of all would be just multi-everything services from the start, and us-east-1 is just another node, but that's expensive and tricky with state.


> If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.

This describes, what, under 1% of companies out there?

For most companies the cost of being multi-region is much more than just accepting with the occasional outage.


I worked for a fortune 500, twice a year we practiced our "catastrophe outage" plan. The target SLA for recovering from a major cloud provider outage was 48 hours.


I thought we were talking about an AWS outage, not just the outage of a single region? A single region can go out for many reasons, including but not limited to war.


Without having a well-defined risk profile that they’re designing to satisfy, everyone’s just kind of shooting from the hip with their opinions on what’s too much or too little.


Exactly this!

One of my projects is entirely hosted on S3. I don't care enough if it becomes unavailable for a few hours to justify paying to distribute it to GCP et al.

And actually for most companies, the cost of multi-cloud is greater than the benefits. Particularly when those larger entities can just bitch to their AWS account manager to get a few grand refunded as credits.


It is like discussing zombie apocalypse. People who are invested in bunkers will hardly understand those who are just choosing death over living in those bunkers for a month longer.


> Planning for an AWS outage […]

What about if your account gets deleted? Or compromised and all your instances/services deleted?

I think the idea is to be able to have things continue running on not-AWS.


This. I wouldn't try to instantly failover to another service if AWS had a short outage, but I would plan to be able to recover from a permanent AWS outage by ensuring all your important data and knowledge is backed up off-AWS, preferably to your own physical hardware and having a vague plan of how to restore and bring things up again if you need to.

"Permanent AWS outage" includes someone pressing the wrong button in the AWS console and deleting something important or things like a hack or ransomware attack corrupting your data, as well as your account being banned or whatever. While it does include AWS itself going down in a big way, it's extremely unlikely that it won't come back, but if you cover other possibilities, that will probably be covered too.


This is planning future based on best things in the past. Not completely irrational and if you can't afford plan B, okayish.

But thinking AWS SLA is granted forever, and everyone should put all its eggs in it because "everyone do it" is neither wise not safe. Those who can afford it, and there are many businesses like that out there, should have a plan B. And actually AWS should not necessarily be plan A.

Nothing is forever. Not the Roman empire, not Inca empire, not china dynasties, not USA geological supremacy. That's not a question of if but when. It doesn't need to be through a lot of suffering, but if we don't systematically organise for a humanity which spread well being for everyone in a systematically resilient way, we will make it through more lot of tragic consequences when this or that single point of failure finally falls.


Completely agree, but I think companies need to be aware of the AWS risks with third parties as well. Many services were unable to communicate with customers.

Hosting your services on AWS while having a status page on AWS during an AWS outage is an easily avoidable problem.


This depends on the scale of company. A fully functional DR plan probably costs 10% of the infra spend + people time for operationalization. For most small/medium businesses its a waste to plan for a once per 3-10 year event. If you’re a large or legacy firm the above costs are trivial and in some cases it may become a fiduciary risk not to take it seriously.


And if you're in a regulated industry it might even be a hard requirement.


Using AWS instead of a server in the closet is step 1.

Step 2 is multi-AZ

Step 3 is multi-region

Step 4 is multi-cloud.

Each company can work on it's next step, but most will not have positive EROI going from 2 to 3+


Multi-cloud is a hole in which you can burn money and not much more


We started that planning process at my previous company after one such outage but it became clear very quickly that the costs of such resilience would be 2-3x hosting costs in perpetuity and who knows how many manhours. Being down for an hour was a lot more palatable to everyone


What if AWS dumps you because your country/company didn't please the commander in chief enough?

If your resilience plan is to trust a third party, that means you don't really care about going down, does it?

Besides that, as the above poster said, the issue with top tier cloud providers (or cloudflare, or google, etc) is not just that you rely on them, it is that enough people rely on them that you may suffer even if you don't.


I worked at an adtech company where we invested a bit in HA across AZ + regions. Lo and behold there was an AWS outage and we stayed up. Too bad our customers didn't and we still took the revenue hit.

Lesson here is that your approach will depend on your industry and peers. Every market will have their won philosophy and requirements here.


Sure, if your blog or whatever goes down who cares. But otherwise you should thinking about disaster planning and resilience.

AWS US-East 1 has many outages. Anything significant should account for that.


My website running on an old laptop in my cupboard is doing just fine.


I have this theory of something I call “importance radiation.”

An old mini PC running a home media server and a Minecraft server will run flawlessly forever. Put something of any business importance on it and it will fail tomorrow.

Related, I’m sure, is the fact that things like furnaces and water heaters will die on holidays.


That's a great concept. It explains a lot, actually!


When your laptop dies it's gonna be a pretty long outage too.


I will find another one


> to the tune of a few hours every 5-10 years

I presume this means you must not be working for a company running anything at scale on AWS.


That is the vast majority of customers on AWS.


Ha ha, fair, fair.


> Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years.

Not only that, but as you're seeing with this and the last few dozen outages... when us-east-1 goes down, a solid chunk of what many consumers consider the "internet" goes down. It's perceived less as "app C is down" and more is "the internet is broken today".


Isn't the endpoint of that kind of thinking an even more centralized and fragile internet?


To be clear, I'm not advocating for this or trying to suggest it's a good thing. That's just reality as I see it.

If my site's offline at the same time as the BBC has front page articles about how AWS is down and it's broken half the internet... it makes it _really_ easy for me to avoid blame without actually addressing the problem.

I don't need to deflect blame from my customers. Chances are they've already run into several other broken services today, they've seen news articles about it, and all from third parties. By the time they notice my service is down, they probably won't even bother asking me about it.

I can definitely see this encouraging more centralization, yes.


More like 2-3 times per year and this is not counting smaller outages or simply APIs that don't do what they document.


> APIs that don’t do what they document

Oh god, this. At my company, we found a bug recently with rds.describe_events, which we needed to read binlog information after a B/G cutover. The bug, which AWS support “could not see the details of,” was that events would non-deterministically not show up if you were filtering by instance name. Their recommended fix was to pull in all events for the past N minutes, and do client-side filtering.

This was on top of the other bug I had found earlier, which was that despite the docs stating that you can use a B/G as a filter - a logical choice when querying for information directly related to the B/G you just cut over - doing so returns an empty set. Also, you can’t use a cluster (again, despite docs stating otherwise), you have to use the new cluster’s writer instance.


While I don't know your specific case, I have seen it happen often enough that there are only two possibilities left:

  1. they are idiots 
  2. they do it on purpose and they think you are an idiot
For me, it just means that the moment you integrate with any API, you are basically their bitch (unless you implement one from every competitor in the market, at which point you can just as well do it yourself).


Telefonica is moving it 5G core network to AWS

https://aws.amazon.com/blogs/industries/o2-telefonica-moves-...

A few hours could be a problem.

Not to mention it creates valuable a single point of failure for a hostile attack.


> tune of a few hours every 5-10 years

You know that’s not true, is-east-1 last one was 2 years ago? But other services have bad days and foundational one drag others a long


It’s even worse than that - us-east-1 is so overloaded, and they have roughly 5+ outages per year on different services. They don’t publish outage numbers so it’s hard to tell.

At this point, being in any other region cuts your disaster exposure dramatically


We don’t deploy to us-east but still so many of our API partners and 3rd party services were down a large chunk of the service was effectively down. Including stuff like many dev tools


Depends on how serious you are with SLA's.


Been doing this for about 8 years and I've worked through a serious AWS disruption at least 5 times in that time.


Depends on the business. For 99% of them this is for sure the right answer.


In before meteor strike takes a AWS region and they cant restore data.


It seems like this can be mostly avoided by not using us-east-1.


Maybe; but Parlar had no plan and are now nothing....because AWS decided to shut them off. Always have a good plan...


Thank you for illustrating my point. You didn't even bother to read the second paragraph.


> Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.

my business contingency plan for "AWS shuts down and never comes back up" is to go bankrupt


Is that also your contingency plan for 'user uploads objectionable content and alerts Amazon to get your account shut down'?

Make sure you let your investors know.


If your mitigation for that risk is to have an elaborate plan to move to a different cloud provider, where the same problem can just happen again, then you’re doing an awful job of risk management.


> If your mitigation for that risk is to have an elaborate plan to move to a different cloud provider, where the same problem can just happen again, then you’re doing an awful job of risk management.

Where did I say that? If I didn't say it: could you please argue in good faith. Thank you.


"Is that also your contingency plan if unrelated X happens", and "make sure your investors know" are also not exactly good faith or without snark, mind you.

I get your point, but most companies don't need Y nines of uptime, heck, many should probably not even use AWS, k8s, serverless or whatever complicated tech gives them all these problems at all, and could do with something far simpler.


The point is, many companies do need those nines and they count on AWS to deliver and there is no backup plan if they don't. And that's the thing I take issue with, AWS is not so reliable that you no longer need backups.


My experience is that very few companies actually need those 9s. A company might say they need them, but if you dig in it turns out the impact on the business of dropping a 9 (or two) is far less than the cost of developing and maintaining an elaborate multi-cloud backup plan that will both actually work when needed and be fast enough to maintain the desired availability.

Again, of course there are exceptions, but advising people in general that they should think about what happens if AWS goes offline for good seems like poor engineering to me. It’s like designing every bridge in your country to handle a tomahawk missile strike.


HN denizens are more often than not founders of exactly those companies that do need those 9's. As I wrote in my original comment: the founders are usually shocked at the thought that such a thing could happen and it definitely isn't a conscious decision that they do not have a fall-back plan. And if it was a conscious decision I'd be fine with that, but it rarely is. About as rare as companies that have in fact thought about this and whose incident recovery plans go further than 'call George to reboot the server'. You'd be surprised how many companies have not even performed the most basic risk assessment.


We all read it.. AWS not coming back up is your point on nat having a backup plan?

You might as well say the entire NY + DC metro losses power and "never comes back up" What is the plan around that? The person replying is correct, most companies do not have a actionable plan for AWS never coming back up.

I worked at a medium-large company and was responsible for reviewing the infrastructure BCP plan. It stated that AWS going down was a risk, and if it happens we wait for it to come back up. (In a lot more words than that).


I get you. I am with you. But isn't money/resources always a constraint to have a solid backup solution?

I guess the reason why people are not doing it is because it hasn't been demonstrated it's worth it, yet!

I've got to admit though, whenever I hear about having a backup plan I think having apples to apples copy elsewhere which is probably not wise/viable anyway. Perhaps having just enough to reach out to the service users/customers suffice.

Also I must add I am heavily influenced by a comment by Adrian Cockroft on why going multi cloud isn't worth it. He worked for AWS (at the time at least) so I should have probably reached to the salt dispenser.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: