Designing for failure does not necessarily mean multi-cloud

This content is 8 years old. I don't routinely update old blog posts as they are only intended to represent a view at a particular point in time. Please be warned that the information here may be out of date.

Earlier this week, Amazon Web Services’ S3 storage service suffered an outage that affected many websites (including popular sites to check if a website is down for everyone or just you!).

Unsurprisingly, this led to a lot of discussion about designing for failure – or not, it would seem in many cases, including the architecture behind Amazon’s own status pages:

The Amazon and Azure models are slightly different but in the past we’ve seen outages to the Azure identity system (for example) impact on other Microsoft services (Office 365). When that happened, Microsoft’s Office 365 status page didn’t update because of a caching/CDN issue. It seems Amazon didn’t learn from Microsoft’s mistakes!

Randy Bias (@RandyBias) is a former Director at OpenStack and a respected expert on many cloud concepts. Randy and I exchanged many tweets on the topic of the AWS outage but, after multiple replies, I thought a blog post might be more appropriate. You see, I hold the view that not all systems need to be highly available. Sometimes, failure is OK. It all comes down to requirements:

And, as my colleague Tim Siddle highlighted:

I agree. 100%.

So, what does that architecture look like? Well, it will vary according to the provider:

So, if we want to make sure our application can survive a region failure, there are ways to design around this. Just be ready for the solution we sold to the business based on using commodity cloud services to start to look rather expensive. Whereas on-premises we typically have two datacentres with resilient connections, then we’ll want to do the same in the cloud. But, just as not all systems are in all datacentres on-premises, that might also be the case in the cloud. If it’s a service for which some downtime can be tolerated, then we might not need to worry about a multi-region architecture. In cases where we’re not at all concerned about downtime we might not even use an availability set

Other times – i.e. if the application is a web service for which an outage would cause reputational or financial damage – we may have a requirement for higher availability.  That’s where so many of the services impacted by Tuesday’s AWS outage went wrong:

Of course, we might spread resources around regions for other reasons too – like placing them closer to users – but that comes back to my point about requirements. If there’s a requirement for fast, low-latency access then we need to design in the dedicated links (e.g. AWS Direct Connect or Azure ExpressRoute) and we’ll probably have more than one of them too, each terminating in a different region, with load balancers and all sorts of other considerations.

Because a cloud provider could be one of those single points of failure, many people are advocating multi-cloud architectures. But, if you think multi-region is expensive, get ready for some seriously complex architecture and associated costs in a multi-cloud environment. Just as in the on-premises world, many enterprises use a single managed services provider (albeit with multiple datacentres), in the cloud many of us will continue to use a single cloud provider.  Designing for failure does not necessarily mean multi-cloud.

Of course, a single-cloud solution has its risks. Randy is absolutely spot on in his reply below:

It could be argued that one man’s “lock-in” is another’s “making the most of our existing technology investments”. If I have a Microsoft Enterprise Agreement, I want to make sure that I use the software and services that I’m paying for. And running a parallel infrastructure on another cloud is probably not doing that. Not unless I can justify to the CFO why I’m running redundant systems just in case one goes down for a few hours.

That doesn’t mean we can avoid designing with the future in mind. We must always have an exit strategy and, where possible, think about designing systems with a level of abstraction to make them cloud-agnostic.

Ultimately though it all comes back to requirements – and the ability to pay. We might like an Aston Martin but if the budget is more BMW then we’ll need to make some compromises – with an associated risk, signed off by senior management, of course.

[Updated 2 March 2017 16:15 to include the Mark Twomey tweet that I missed out in the original edit]

One thought on “Designing for failure does not necessarily mean multi-cloud

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.