A couple of months ago, Facebook released a whole load of information about its servers and datacentres in a programme it calls the Open Compute Project. At around about the same time, I was sitting in a presentation at Microsoft, where I was introduced to some of the concepts behind their datacentres. These are not small operations – Facebook’s platform currently serves around 600 million users and Microsoft’s various cloud properties account for a good chunk of the Internet, with the Windows Azure appliance concept under development for partners including Dell, HP, Fujitsu and eBay.
It’s been a few years since I was involved in any datacentre operations and it’s interesting to hear how times have changed. Whereas I knew about redundant uninterruptible power sources and rack-optimised servers, the model is now about containers of redundant servers and the unit of scale has shifted. An appliance used to be a 1U (pizza box) server with a dedicated purpose but these days it’s a shipping container full of equipment!
There’s also been a shift from keeping the lights on at all costs, towards efficiency. Hardly surprising, given that the IT industry now accounts for around 3% of the world’s carbon emissions and we need to reduce the environmental impact. Google’s datacentre design best practices are all concerned with efficiency: measuring power usage effectiveness; measuring managing airflow; running warmer datacentres; using “free” cooling; and optimising power distribution.
So how do Microsoft (and, presumably others like Amazon too) design their datacentres? And how can we learn from them when developing our own private cloud operations?
Some of the fundamental principles include:
- Perception of infinite capacity.
- Perception of continuous availability.
- Drive predictability.
- Taking a service provider approach to delivering infrastructure.
- Resilience over redundancy mindset.
- Minimising human involvement.
- Optimising resource usage.
- Incentivising the desired resource consumption behaviour.
In addition, the following concepts need to be adopted to support the fundamental principles:
- Cost transparency.
- Homogenisation of physical infrastructure (aggressive standardisation).
- Pooling compute resource.
- Fabric management.
- Consumption-based pricing.
- Virtualised infrastructure.
- Service classification.
- Holistic approach to availability.
- Computer resource decay.
- Elastic infrastructure.
- Partitioning of shared services.
In short, provisioning the private cloud is about taking the same architectural patterns that Microsoft, Amazon, et al use for the public cloud and implementing them inside your own data centre(s). Thinking service, not server to develop an internal infrastructure as a service (IaaS) proposition.
I won’t expand on all of the concepts here (many are self-explanitory), but some of the key ones are:
- Create a fabric with resource pools of compute, storage and network, aggregated into logical building blocks.
- Introduced predictability by defining units of scale and planning activity based on predictable actions (e.g. certain rates of growth).
- Design across fault domains – understand what tends to fail first (e.g. the power in a rack) and make sure that services span these fault domains.
- Plan upgrade domains (think about how to upgrade services and move between versions so service levels can be maintained as new infrastructure is rolled out).
- Consider resource decay – what happens when things break? Think about component failure in terms of service delivery and design for that. In the same way that a hard disk has a number of spare sectors that are used when others are marked bad (and eventually too many fail, so the disk is replaced), take a unit of infrastructure and leave faulty components in place (but disabled) until a threshold is crossed, after which the unit is considered faulty and is replaced or refurbished.
A smaller company, with a small datacentre may still think in terms of server components – larger organisations may be dealing with shipping containers. Regardless of the size of the operation, the key to success is thinking in terms of services, not servers; and designing public cloud principles into private cloud implementations.
Hi Mark,
I only take issue with one item and that’s Resilience over Redundancy. I would have thought that the AWS recent debacle, Microsoft’s September 2010 outage and Google’s difficulties in service continuity might have taught the industry a lesson. Resilience is about failure resistance. Redundancy is about overcoming the failure of resilience. That’s why you have two ears and two eyes. Until the industry recognises that new isn’t always best and that some well tried and tested old-fashioned principles will always apply, we will all be spending more time apologising.
Hi Robb – I kind of agree with you (i.e. we do need both) but it’s not just about hardware.
If I recall correctly the point that Microsoft was making about resilience over redundancy it was not that we don’t need both, but that we don’t care about arrays of redundant disks, etc. – in this world of commodity hardware and container-sized failure units the redundancy is elsewhere in the architecture and if a device fails, we just fall over to the next one (kind of like bad sectors on a hard disk).
The redundancy is now in software. Services need to be designed to be resilient to failure – maybe even with multiple cloud providers, certainly in a manner that guards against loss of a datacentre for a period.
Sorry if that isn’t entirely clear from the original post.