One unexpected upside moving from a DC to AWS is when a region is down, customers are far more understanding. Instead of being upset, they often shrug it off since nothing else they needed/wanted was up either.
This is a remarkable and unfair truth. I have had this experience with Office365...when they're down a lot of customers don't care because all their customers are also down.
It took me so long to realise this is what's important in enterprise. Uptime isn't important, being able to blame someone else is what's important.
If you're down for 5 minutes a year because one of your employees broke something, that's your fault, and the blame passes down through the CTO.
If you're down for 5 hours a year but this affected other companies too, it's not your fault
From AWS to Crowdstrike - system resilience and uptime isn't the goal. Risk mitigation isn't the goal. Affordability isn't the goal.
When the CEO's buddies all suffer at the same time as he does, it's just an "act of god" and nothing can be done, it's such a complex outcome that even the amazing boffins at aws/google/microsoft/cloudflare/etc can't cope.
If the CEO is down at a different time than the CEO's buddies then it's that Dave/Charlie/Bertie/Alice can't cope and it's the CTO's fault for not outsourcing it.
As someone who likes to see things working, it pisses me off no end, but it's the way of the world, and likely has been whenever the owner and CTO are separate.
A slightly less cynical view: execs have a hard filter for “things I can do something about” and “things I can’t influence at all.” The bad ones are constantly pushing problems into the second bucket, but there are legitimately gray area cases. When an exec smells the possibility that their team could have somehow avoided a problem, that’s category 1 and the hammer comes down hard.
After that process comes the BS and PR step, where reality is spun into a cotton candy that makes the leader look good no matter what.
As they say, every cloud outage has a silver lining.
* Give the computers a rest, they probably need it. Heck, maybe the Internet should just shut down in the evening so everyone can go to bed (ignoring those pesky timezone differences)
* Free chaos engineering at the cloud provider region scale, except you didn't opt in to this one and know about in advance, making it extra effective
* Quickly figure out a map which of the things you use have a dependency on a single AWS region without no capability to change or re-route
This still happens in some places. In various parts of Europe there are legal obligations not to email employees out of hours if it is avoidable. Volkswagen famously adopted a policy in Germany of only enabling receipt of new email messages for most of their employees 30 minutes before start of the working day, then disabling 30 minutes after the end, with weekends turned off also. You can leave work on Friday and know you won't be receiving further emails until Monday.
I was once told that our company went with Azure because when you tell the boomer client that our service is down because Microsoft had an outage, they go from being mad at you, to accepting that the outage was an act of god that couldn’t be avoided.
Is us-east-1 equally unstable to the other regions? My impression was that Amazon deployed changes to us-east-1 first so it's the most unstable region.
I've heard this so many times and not seen it contradicted so I started saying it myself. Even my last Ops team wanted to run some things in us-east-1 to get prior warning before they broke us-west-1.
But there are some people on Reddit who think we are all wrong but won't say anything more. So... whatever.
Nothing in the outage history really stands out as "this is the first time we tried this and oops" except for us-east-1.
It's always possible for things to succeed at a smaller scale and fail at full scale, but again none of them really stand out as that to me. Or at least, not any in the last ten years. I'm allowing that anything older than that is on the far side of substantial process changes and isn't representative anymore.
I do know from previous discussions that some companies are in us-east-1 because of business partnerships with other inhabitants and if one moves out the costs and latency goes up. So they are all stuck in this boat together.
Still, it would make a bit of sense if you can find a place in your code where crossing a region hurts less, to move some of your services to a different region.
While your business partners will understand that you’re down while they’re down, will your customers? You called yesterday to say their order was ready, and now they can’t pick it up?
Check the URL, we had an issue a couple of years ago with the Workspaces. US East was down but all of our stuff was in EU.
Turns out the default URL was hardcoded to use the us east interface and just by going to workspaces and then editing your URL to be the local region got everyone working again.
Unless you mean nothing is working for you at the moment.