Earlier this week, Amazon Web Services’ S3 storage service suffered an outage that affected many websites (including popular sites to check if a website is down for everyone or just you!).
S3 is experiencing high error rates. We are working hard on recovering.
— Amazon Web Services (@awscloud) February 28, 2017
Unsurprisingly, this led to a lot of discussion about designing for failure – or not, it would seem in many cases, including the architecture behind Amazon’s own status pages:
The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates.
— Amazon Web Services (@awscloud) February 28, 2017
The Amazon and Azure models are slightly different but in the past we’ve seen outages to the Azure identity system (for example) impact on other Microsoft services (Office 365). When that happened, Microsoft’s Office 365 status page didn’t update because of a caching/CDN issue. It seems Amazon didn’t learn from Microsoft’s mistakes!
Randy Bias (@RandyBias) is a former Director at OpenStack and a respected expert on many cloud concepts. Randy and I exchanged many tweets on the topic of the AWS outage but, after multiple replies, I thought a blog post might be more appropriate. You see, I hold the view that not all systems need to be highly available. Sometimes, failure is OK. It all comes down to requirements:
@randybias Depends what the system is. Not everything needs to be highly available. There’s a requirements/cost/risk trade-off
— Mark Wilson (@markwilsonit) March 1, 2017
And, as my colleague Tim Siddle highlighted:
@markwilsonit DR/multi-region/multi-cloud is expensive – and it’s always a requirement, until the cost is laid bare…..
— Tim Siddle (@tim_siddle) February 28, 2017
I agree. 100%.
@markwilsonit and of course, much depends on the application architecture itself
— Tim Siddle (@tim_siddle) February 28, 2017
So, what does that architecture look like? Well, it will vary according to the provider:
- For AWS we need to think about regions and availability zones. Each region is made up of a number of availability zones (at least two according to the AWS glossary).
- For Azure there are more regions and these are paired for availability (for example when using geo-redundant storage). In addition, each region will consist of multiple datacentre facilities.
So, if we want to make sure our application can survive a region failure, there are ways to design around this. Just be ready for the solution we sold to the business based on using commodity cloud services to start to look rather expensive. Whereas on-premises we typically have two datacentres with resilient connections, then we’ll want to do the same in the cloud. But, just as not all systems are in all datacentres on-premises, that might also be the case in the cloud. If it’s a service for which some downtime can be tolerated, then we might not need to worry about a multi-region architecture. In cases where we’re not at all concerned about downtime we might not even use an availability set…
Other times – i.e. if the application is a web service for which an outage would cause reputational or financial damage – we may have a requirement for higher availability. That’s where so many of the services impacted by Tuesday’s AWS outage went wrong:
No one claims 100% up time, FOR A REASON
— Jeorry Balasabas (@jeorryb) February 28, 2017
And understand it when designing cloud solutions, still your responsibility to deliver resilience, can’t abdicate that to someone else https://t.co/VGdunBJqSH
— Paul Stringfellow (@techstringy) February 28, 2017
Amazon’s S3 outage is not just a case of getting what you paid for it’s also about getting what you designed for. Availability isn’t cheap.
— Mark Twomey (@Storagezilla) February 28, 2017
Of course, we might spread resources around regions for other reasons too – like placing them closer to users – but that comes back to my point about requirements. If there’s a requirement for fast, low-latency access then we need to design in the dedicated links (e.g. AWS Direct Connect or Azure ExpressRoute) and we’ll probably have more than one of them too, each terminating in a different region, with load balancers and all sorts of other considerations.
Because a cloud provider could be one of those single points of failure, many people are advocating multi-cloud architectures. But, if you think multi-region is expensive, get ready for some seriously complex architecture and associated costs in a multi-cloud environment. Just as in the on-premises world, many enterprises use a single managed services provider (albeit with multiple datacentres), in the cloud many of us will continue to use a single cloud provider. Designing for failure does not necessarily mean multi-cloud.
Of course, a single-cloud solution has its risks. Randy is absolutely spot on in his reply below:
@markwilsonit Public clouds are walled gardens and create significant points of lock-in. Long term AWS is no different than Oracle software.
— Randy Bias (@randybias) March 1, 2017
It could be argued that one man’s “lock-in” is another’s “making the most of our existing technology investments”. If I have a Microsoft Enterprise Agreement, I want to make sure that I use the software and services that I’m paying for. And running a parallel infrastructure on another cloud is probably not doing that. Not unless I can justify to the CFO why I’m running redundant systems just in case one goes down for a few hours.
That doesn’t mean we can avoid designing with the future in mind. We must always have an exit strategy and, where possible, think about designing systems with a level of abstraction to make them cloud-agnostic.
Ultimately though it all comes back to requirements – and the ability to pay. We might like an Aston Martin but if the budget is more BMW then we’ll need to make some compromises – with an associated risk, signed off by senior management, of course.
[Updated 2 March 2017 16:15 to include the Mark Twomey tweet that I missed out in the original edit]