Cloud is dead. Long live the cloud!

I’ve seen a few articles recently that talking about how organisations are moving workloads out of the cloud and back to their own datacentres. Sometimes they are little more than clickbait. But there is a really important discussion to be had here. So I thought I’d lift the lid on this topic and have a look at what I think is really going on.

The promise of the cloud

Cloud is great for many things. On-demand access to vast amounts of computing and storage resource, on a pay as you go basis. Brilliant. No need to invest in capital. Just pay for what you use.

Except that’s not how all businesses work. At least not for all application workloads and data sets.

Possibly the most famous of these “we found cloud expensive and moved back on-prem” articles is David Heinemeier Hansson (@DHH)’s why we’re leaving the cloud post for 37 Signals, written in 2022. In that post, DHH says that renting someone else’s computers didn’t work for his business. He describes 37 Signals as a “medium-sized business with stable growth”. But, I’m willing to bet that most of the readers of this post are not running SaaS applications in AWS for a global audience of B2B and B2C customers. Some will be, but most of my clients are not.

In fact, in his video on kicking cloud to the curb [sic], David Linthicum (@DavidLinthicum) flags that SaaS providers will scale in a repeated pattern, whereas enterprise [and SME] workloads scale differently. Cloud still has a place for most organisations. DHH’s follow-up post (the Big Cloud Exit FAQ) is worth a read too. Just remember that most business don’t follow the profile of 37 Signals. And that 37 Signals are still using co-lo facilities (because building new datacentres in 2024 is a very brave move, unless you are a hyperscaler).

But you’re not 37 Signals

In 2024, I would seriously question why anyone is running their own office productivity tools (email, IM, intranet, etc.) on-premises. There are many services that can do this for you on a per-person-per-month basis. And they will have better up-time than you ever did, despite what your former email administrator tells you. Those jokes about “Microsoft 364” whenever there’s a blip in the matrix… how much more did you spend on storage to make sure that you got to even 99.5% availability in your Exchange servers?

But let’s move on past the “low hanging fruit” that can relatively easily be replaced by SaaS. Let’s have a look at all those other applications that actually run your business: the finance system; the case management system; the modern data platform; the reporting and analytics; the years and years of accumulated unstructured file data that no-one knows what is needed and what is not. (“The business”* says “that it’s up to IT to sort out”. IT says “we don’t know what you need”. No-one agrees to the blanket retention policies, just in case that file deleted after 3, 7, 10 years is really important.)

What I’ve seen happen, time and time again, is that almost everything is moved to the cloud. I say almost, because the cloud discovery process often turns up evidence of virtual machines that were created, are no longer used, but are left running. This happens because on-premises infrastructure is seen as “paid for”. There is no cost to leaving things running. Except there is – not just in wasted processor cycles and storage, but in the size of the infrastructure that’s required.

Lifting and shifting without transformation

There are many motivations for cloud migrations but the most common I see is because the datacentre is closing. Maybe it’s the end of an outsource, maybe the site is being sold for redevelopment. But it’s nearly always “we must exit by” a particular date. No time to transform – just transition. We’ll sort it out later. Except “later” never comes. The project to move to the cloud is completed. The team is stood down. The partners are disengaged. “Phase 2” to transform the estate doesn’t have a strong enough business case** and things stay the same.

And then the cloud bills come in. They look a bit steep – especially for IaaS. You’re using more storage than you expected, and those VMs are a little pricey. Some “optimisation” is done to adjust VM sizes. Reserved Instances and other benefits are used to reduce the monthly charge.

Watch the costs rise

A year later, the prices rise. Inflation. Exchange rate variance vs. the $ or the € (depending on your provider’s base currency). That’s OK, it was always expected. Wasn’t it? Another round of “optimisation” happens. A couple of applications are no longer used, replaced by SaaS. Some VMs are switched off.

Rinse and repeat. Rinse and repeat.

A few years on, but you’ve still not transformed. Those resources that you “lifted and shifted” to the cloud are, like the old adage, the same computers running in someone else’s data centre.

The CFO looks at the cloud bill and says “how much?!”. It looks astronomical compared with industry norms. They bring in a new Head of IT and tell them that they have to reduce the cloud spend. “We’ll move back on-premises – where it used to cost less”, they agree.

But it’s still the same systems. With the same technical debt. And now it needs power, and water, and expensive servers and storage and… you see where we are going.

Refactor or modernise

Cloud is not a cycle, like in-source/out-source. It’s a business model. And, like all business models you need to tune the way they are used to make best use of them. N-tier applications running on VMs in IaaS will generally not be cost-effective. Look at how to move the presentation tier to web services. Can the application be re-factored? Could the database run in PaaS too? Often the challenge is ISV support. But it’s 2024. If your vendor doesn’t have native support for Azure or AWS, maybe it’s time to find a different vendor.

And if you’re moving to the cloud to save money, maybe it’s time to look again at your business case.

Use the cloud for innovation, not to save money

Cloud can save money. But only after the workloads are transformed. And only then with continual optimisation. The trick is to make the effort you put into transformation cost less than the savings you return through efficiencies. We can do this on-prem too, but it normally involves capital spend. And that’s another major advantage of the cloud. Once you’re there you can use it to try out new products and services, without a major investment. All that AI innovation that’s happening right now. You can try it out in the cloud, for relatively little effort. Now imagine you needed an investment case for the infrastructure to develop new AI models in house? Cloud gave you agility and flexibility.

And don’t forget about efficiency

To borrow a metaphor from David Linthicum, remember that cloud is a utility. If you leave the heating and lights on at home, you can expect a big bill. It’s no different in the cloud, if you run inefficient infrastructure and applications.

Look at the long-term viability and placement for your services, Make right-sizing decisions based on application workload and datasets. The problem isn’t the cloud – it’s that some people are trying to use it for the wrong things.

* I used this term to be deliberately provocative. I could write a whole separate post on the concept of “the business” vs. “IT”.
** It should have. If properly thought through.

Featured image: author’s own

Amazon Web Services (AWS) Summit: London Recap

This content is 6 years old. I don't routinely update old blog posts as they are only intended to represent a view at a particular point in time. Please be warned that the information here may be out of date.

I’ve written previously about the couple of days I spent at ExCeL in February, learning about Microsoft’s latest developments at the Ignite Tour and, a few weeks later I found myself back at the same venue, this time focusing on Amazon Web Services (AWS) at the London AWS Summit (four years since my last visit).

Even with a predominantly Microsoft-focused client base, there are situations where a multi-cloud solution is required and so, it makes sense for me to expand my knowledge to include Amazon’s cloud offerings. I may not have the detail and experience that I have with Microsoft Azure, but certainly enough to make an informed choice within my Architect role.

One of the first things I noticed is that, for Amazon, it’s all about the numbers. The AWS Summit had a lot of attendees – 12000+ were claimed, for more than 60 technical sessions supported by 98 sponsoring partners. Frankly, it felt to me that there were a few too many people there at times…

AWS is clearly growing – citing 41% growth comparing Q1 2019 with Q1 2018. And, whilst the comparisons with the industrial revolution and the LSE research that shows 95% of today’s startups would find traditional IT models limiting today were all good and valid, the keynote soon switched to focus on AWS claims of “more”. More services. More depth. More breadth.

There were some good customer slots in the keynote: Sainsbury’s Group CIO Phil Jordan and Group Digital Officer Clodagh Moriaty spoke about improving online experiences, integrating brands such as Nectar and Sainsbury’s, and using machine learning to re-plan retail space and to plan online deliveries. Ministry of Justice CDIO Tom Read talked about how the MOJ is moving to a microservice-based application architecture.

After the keynote, I immersed myself in technical sessions. In fact, I avoided the vendor booths completely because the room was absolutely packed when I tried to get near. My afternoon consisted of:

  • Driving digital transformation using artificial intelligence by Steven Bryen (@Steven_Bryen) and Bjoern Reinke.
  • AWS networking fundamentals by Perry Wald and Tom Adamski.
  • Creating resilience through destruction by Adrian Hornsby (@adhorn).
  • How to build an Alexa Skill in 30 minutes by Andrew Muttoni (@muttonia).

All of these were great technical sessions – and probably too much for a single blog post but, here goes anyway…

Driving digital transformation using artificial intelligence

Amazon thinks that driving better customer experience requires Artificial Intelligence (AI), specifically Machine Learning (ML). Using an old picture of London Underground workers sorting through used tickets in the 1950s to identify the most popular journeys, Steven Bryen suggested that more data leads to better analytics and better outcomes that can be applied in more ways (in a cyclical manner).

The term “artificial intelligence” has been used since John McCarthy coined it in 1955. The AWS view is that AI taking off because of:

  • Algorithms.
  • Data (specifically the ability to capture and store it at scale).
  • GPUs and acceleration.
  • Cloud computing.

Citing research from PwC [which I can’t find on the Internet], AWS claim that world GDP was $80Tn in 2018 and is expected to be $112Tn in 2030  ($15.7Tn of which can be attributed to AI).

Data science, artificial intelligence, machine learning and deep learning can be thought of as a series of concentric rings.

Machine learning can be supervised learning (betting better at finding targets); unsupervised (assume nothing and question everything); or reinforcement learning (rewarding high performing behaviour).

Amazon claims extensive AI experience through its own ML experience:

  • Recommendations Engine
  • Prime Air
  • Alexa
  • Go (checkoutless stores)
  • Robotic warehouses – taking trolleys to packer to scan and pack (using an IoT wristband to make sure robots avoid maintenance engineers).

Every day Amazon applies new AI/ML-based improvements to its business, at a global scale through AWS.

Challenges for organisations are that:

  • ML is rare
  • plus: Building and scaling ML technology is hard
  • plus: Deploying and operating models in production is time-consuming and expensive
  • equals: a lack of cost-effective easy-to-use and scalable ML services

Most time is spent getting data ready to get intelligence from it. Customers need a complete end-to-end ML stack and AWS provides that with edge technologies such as Greengrass for offline inference and modelling in SageMaker. The AWS view is that ML prediction becomes a RESTful API call.

With the scene set, Steven Bryen handed over to Bjoern Reinke, Drax Retail’s Director of Smart Metering.

Drax has converted former coal-fired power stations to use biomass: capturing carbon into biomass pellets, which are burned to create steam that drives turbines – representing 15% of the UK’s renewable energy.

Drax uses a systems thinking approach with systems of record, intelligence and engagement

System of intelligence need:

  • Trusted data.
  • Insight everywhere.
  • Enterprise automation.

Customers expect tailoring: efficiency; security; safety; and competitive advantage.

Systems of intelligence can be applied to team leaders, front line agents (so they already know that customer has just been online looking for a new tariff), leaders (for reliable data sources), and assistant-enabled recommendations (which are no longer futuristic).

Fragmented/conflicting data is pumped into a data lake from where ETL and data warehousing technologies are used for reporting and visualisation. But Drax also pull from the data lake to run analytics for data science (using Inawisdom technology).

The data science applications can monitor usage and see base load, holidays, etc. Then, they can look for anomalies – a deviation from an established time series. This might help to detect changes in tenants, etc. and the information can be surfaced to operations teams.

AWS networking fundamentals

After hearing how AWS can be used to drive insight into customer activities, the next session was back to pure tech. Not just tech but infrastructure (all be it as a service). The following notes cover off some AWS IaaS concepts and fundamentals.

Customers deploy into virtual private cloud (VPC) environments within AWS:

  • For demonstration purposes, a private address range (CIDR) was used – 172.31.0.0/16 (a private IP range from RFC 1918). Importantly, AWS ranges should be selected to avoid potential conflicts with on-premises infrastructure. Amazon recommends using /16 (65536 addresses) but network teams may suggest something smaller.
  • AWS is dual-stack (IPv4 and IPv6) so even if an IPv6 CIDR is used, infrastructure will have both IPv4 and IPv6 addresses.
  • Each VPC should be broken into availability zones (AZs), which are risk domains on different power grids/flood profiles and a subnet placed in each (e.g. 172.31.0.0/24, 172.31.1.0/24, 172.31.2.0/24).
  • Each VPC has a default routing table but an administrator can create and assign different routing tables to different subnets.

To connect to the Internet you will need a connection, a route and a public address:

  • Create a public subnet (one with public and private IP addresses).
  • Then, create an Internet Gateway (IGW).
  • Finally, Create a route so that the default gateway is the IGW (172.31.0.0/16 local and 0.0.0.0/0 igw_id).
  • Alternatively, create a private subnet and use a NAT gateway for outbound only traffic and direct responses (172.31.0.0/16 local and 0.0.0.0/0 nat_gw_id).

Moving on to network security:

  • Network Security Groups (NSGs) provide a stateful distributed firewall so a request from one direction automatically sets up permissions for a response from the other (avoiding the need to set up separate rules for inbound and outbound traffic).
    • Using an example VPC with 4 web servers and 3 back end servers:
      • Group into 2 security groups
      • Allow web traffic from anywhere to web servers (port 80 and source 0.0.0.0/0)
      • Only allow web servers to talk to back end servers (port 2345 and source security group ID)
  • Network Access Control Lists (NACLs) are stateless – they are just lists and need to be explicit to allow both directions.
  • Flow logs work at instance, subnet or VPC level and write output to S3 buckets or CloudWatch logs. They can be used for:
    • Visibility
    • Troubleshooting
    • Analysing traffic flow (no payload, just metadata)
      • Network interface
      • Source IP and port
      • Destination IP and port
      • Bytes
      • Condition (accept/reject)
  • DNS in a VPC is switched on by default for resolution and assigning hostnames (rather than just using IP addresses).
    • AWS also has the Route 53 service for customers who would like to manage their own DNS.

Finally, connectivity options include:

  • Peering for private communication between VPCs
    • Peering is 1:1 and can be in different regions but the CIDR must not overlap
    • Each VPC owner can send a request which is accepted by the owner on the other side. Then, update the routing tables on the other side.
    • Peering can get complex if there are many VPCs. There is also a limit of 125 peerings so a Transit Gateway can be used to act as a central point but there are some limitations around regions.
    • Each Transit Gateway can support up to 5000 connections.
  • AWS can be connected to on-premises infrastructure using a VPN or with AWS Direct Connect
    • A VPN is established with a customer gateway and a virtual private gateway is created on the VPC side of the connection.
      • Each connection has 2 tunnels (2 endpoints in different AZs).
      • Update the routing table to define how to reach on-premises networks.
    • Direct Connect
      • AWS services on public address space are outside the VPC.
      • Direct Connect locations have a customer or partner cage and an AWS cage.
      • Create a private virtual interface (VLAN) and a public virtual interface (VLAN) for access to VPC and to other AWS services.
      • A Direct Connect Gateway is used to connect to each VPC
    • Before Transit Gateway customers needed a VPN per VPC.
      • Now they can consolidate on-premises connectivity
      • For Direct Connect it’s possible to have a single tunnel with a Transit Gateway between the customer gateway and AWS.
  • Route 53 Resolver service can be used for DNS forwarding on-premises to AWS and vice versa.
  • VPC Sharing provides separation of resources with:
    • An Owner account to set up infrastructure/networking.
    • Subnets shared with other AWS accounts so they can deploy into the subnet.
  • Interface endpoints make an API look as if it’s part of an organisation’s VPC.
    • They override the public domain name for service.
    • Using a private link can only expose a specific service port and control the direction of communications and no longer care about IP addresses.
  • Amazon Global Accelerator brings traffic onto the AWS backbone close to end users and then uses that backbone to provide access to services.

Creating resilience through destruction

Adrian Horn presenting at AWS Summit London

One of the most interesting sessions I saw at the AWS Summit was Adrian Horn’s session that talked about deliberately breaking things to create resilience – which is effectively the infrastructure version of test-driven development (TDD), I guess…

Actually, Adrian made the point that it’s not so much the issues that bringing things down causes as the complexity of bringing them back up.

“Failures are a given and everything will eventually fail over time”

Werner Vogels, CTO, Amazon.com

We may break a system into microservices to scale but we also need to think about resilience: the ability for a system to handle and eventually recover from unexpected conditions.

This needs to consider a stack that includes:

  • People
  • Application
  • Network and Data
  • Infrastructure

And building confidence through testing only takes us so far. Adrian referred to another presentation, by Jesse Robbins, where he talks about creating resilience through destruction.

Firefighters train to build intuition – so they know what to do in the event of a real emergency. In IT, we have the concept of chaos engineering – deliberately injecting failures into an environment:

  • Start small and build confidence:
    • Application level
    • Host failure
    • Resource attacks (CPU, latency…)
    • Network attacks (dependencies, latency…)
    • Region attack
    • Human attack (remove a key resource)
  • Then, build resilient systems:
    • Steady state
    • Hypothesis
    • Design and run an experiment
    • Verify and learn
    • Fix
    • (maybe go back to experiment or to start)
  • And use bulkheads to isolate parts of the system (as in shipping).

Think about:

  • Software:
    • Certificate Expiry
    • Memory leaks
    • Licences
    • Versioning
  • Infrastructure:
    • Redundancy (multi-AZ)
    • Use of managed services
    • Bulkheads
    • Infrastructure as code
  • Application:
    • Timeouts
    • Retries with back-offs (not infinite retries)
    • Circuit breakers
    • Load shedding
    • Exception handing
  • Operations:
    • Monitoring and observability
    • Incident response
    • Measure, measure, measure
    • You build it, your run it

AWS’ Well Architected framework has been developed to help cloud architects build secure, high-performing, resilient, and efficient infrastructure for their applications, based on some of these principles.

Adrian then moved on to consider what a steady state looks like:

  • Normal behaviour of system
  • Business metric (e.g. pulse of Netflix – multiple clicks on play button if not working)
    • Amazon extra 100ms load time led to 1% drop in sales (Greg Linden)
    • Google extra 500ms of load time led to 20% fewer searches (Marissa Mayer)
    • Yahoo extra 400ms of load time caused 5-9% increase in back clicks (Nicole Sullivan)

He suggests asking questions about “what if?” and following some rules of thumb:

  • Start very small
  • As close as possible to production
  • Minimise the blast radius
  • Have an emergency stop
    • Be careful with state that can’t be rolled back (corrupt or incorrect data)

Use canary deployment with A-B testing via DNS or similar for chaos experiment (1%) or normal (99%).

Adrian then went on to demonstrate his approach to chaos engineering, including:

  • Fault injection queries for Amazon Aurora (can revert immediately)
    • Crash a master instance
    • Fail a replica
    • Disk failure
    • Disk congestion
  • DDoS yourself
  • Add latency to network
    • ~ tc qdisc add dev eth0 root netem delay 200ms
  • https://github.com/Netflix/SimianArmy
    • Shut down services randomly
    • Slow down performance
    • Check conformity
    • Break an entire region
    • etc.
  • The chaos toolkit
  • Gremin
    • Destruction as a service!
  • ToxiProxy
    • Sit between components and add “toxics” to test impact of issues
  • Kube-Money project (for Kubernetes)
  • Pumba (for Docker)
  • Thundra (for Lambda)

Use post mortems for correction of errors – the 5 whys. Also, understand that there is no isolated “cause” of an accident.

My notes don’t do Adrian’s talk justice – there’s so much more that I could pick up from re-watching his presentation. Adrian tweeted a link to his slides and code – if you’d like to know more, check them out:

How to build an Alexa Skill in 30 minutes

Spoiler: I didn’t have a working Alexa skill at the end of my 30 minutes… nevertheless, here’s some info to get you started!

Amazon’s view is that technology tries to constrain us. Things got better with mobile and voice is the next step forward. With voice, we can express ourselves without having to understand a user interface [except we do, because we have to know how to issue commands in a format that’s understood – that’s the voice UI!].

I get the point being made – to add an item to a to-do list involves several steps:

  • Find phone
  • Unlock phone
  • Find app
  • Add item
  • etc.

Or, you could just say (for example) “Alexa, ask Ocado to add tuna to my trolley”.

Alexa is a service in the AWS cloud that understands request and acts upon them. There are two components:

  • Alexa voice service – how a device manufacturer adds Alexa to its products.
  • Alexa Skills Kit – to create skills that make something happen (and there are currently more than 80,000 skills available).

An Alexa-enabled device only needs to know to wake up, then stream some “mumbo jumbo” to the cloud, at which point:

  • Automatic speech recognition with translate text to speech
  • Natural language understanding will infer intent (not just text, but understanding…)

Creating skills is requires two parts:

Alexa-hosted skills use Lambda under the hood and creating the skill involves:

  1. Give the skill a name.
  2. Choose the development model.
  3. Choose a hosting method.
  4. Create a skill.
  5. Test in a simulation environment.

Finally, some more links that may be useful:

In summary

Looking back, the technical sessions made my visit to the AWS Summit worthwhile but overall, I was a little disappointed, as this tweet suggests:

Would I recommend the AWS Summit to others? Maybe. Would I watch the keynote from home? No. Would I try to watch some more technical sessions? Absolutely, if they were of the quality I saw on the day. Would I bother to go to ExCeL with 12000 other delegates herded like cattle? Probably not…

What-as-a-service?

This content is 12 years old. I don't routinely update old blog posts as they are only intended to represent a view at a particular point in time. Please be warned that the information here may be out of date.

I’ve written previously about the “cloud stack” of -as-a-service models but I recently saw Microsoft’s Steve Plank (@plankytronixx) give a great description of the differences between on-premise,  infrastructure as a service (IaaS), platform as a service (PaaS) and software as a service (SaaS).

Of course, this is a Microsoft view of the cloud computing landscape and I’ve had other discussions recently where people have argued the boundaries for IaaS or PaaS and confused things further by adding traditional web hosting services into the mix*.  Even so, I think the Microsoft description is a good starting point and it lines up well with the major cloud services offerings from competitors like Amazon and Google.

Not everyone will be familiar with this so I thought it was worth repeating Steve’s description here:

In an on-premise deployment, the owning organisation is responsible for (and has control over) the entire technology stack.

With infrastructure as a service, the cloud service provider manages the infrastructure elements: network, storage, servers and virtualisation. The consumer of the IaaS service will typically have some control over the configuration (e.g. creation of virtual networks, creating virtual machines and storage) but they are all managed by the cloud service provider.  The consumer does, however, still need to manage everything from the operating system upwards, including applying patches and other software updates.

Platform as a service includes the infrastructure elements, plus operating system, middleware and runtime elements. Consumers provide an application, configuration and data and the cloud service provider will run it, managing all of the IT operations including the creation and removal of resources. The consumer can determine when to scale the application up or out but is not concerned with how those instances are operated.

Software as a service provides a “full-stack” service, delivering application capabilities to the consumer, who only has to be concerned about their data.

Of course, each approach has its advantages and disadvantages:

  • IaaS allows for rapid migrations, as long as the infrastructure being moved to the cloud doesn’t rely on other components that surround it on-premise (even then, there may be opportunities to provide virtual networks and extend the on-premise infrastructure to the cloud). The downside is that many of the management issues persist as a large part of the stack is still managed by the consumer.
  • PaaS allows developers to concentrate on writing and packaging applications, creating a service model and leaving the underlying components to the cloud services provider. The main disadvantage is that the applications are written for a particular platform, so moving an application “between clouds” may require code modification.
  • SaaS can be advantageous because it allows for on-demand subscription-based application use; however consumers need to be sure that their data is not “locked in” and can be migrated to another service if required later.

Some organisations go further – for example, in the White Book of Cloud Adoption, Fujitsu wrote about Data as a Service (DaaS) and Business Process as a Service (BPaaS) – but IaaS, PaaS and SaaS are the commonly used models.  There are also many other considerations around data residency and other issues but they are outside the scope of this post. Hopefully though, it does go some way towards describing clear distinctions between the various -as-a-service models.

* Incidentally, I’d argue that traditional web hosting is not really a cloud service as the application delivery model is only part of the picture. If a web app is just running on a remote server it’s not really conforming with the broadly accepted NIST definition of cloud computing characteristics. There is a fine line though – and many hosting providers only need to make a few changes to their business model to start offering cloud services. I guess that would be an interesting discussion with the likes of Rackspace…

“5 reasons to avoid Office 365?” Are you really sure about that?

This content is 14 years old. I don't routinely update old blog posts as they are only intended to represent a view at a particular point in time. Please be warned that the information here may be out of date.

It’s not often these days that I feel the need to defend Microsoft. After all, they’re big boys and girls who can fight their own battles. And yes, I’m an MVP but if you ask Microsoft’s UK evangelists (past and present), I’m sure they’ll tell you I’m pretty critical of Microsoft at times too…

So I was amazed yesterday to read some of the negative press about Office 365. Sure, some Microsoft-bashing is to be expected. So is some comparison with Google Apps. But when I read Richi Jennings5 reasons to avoid Microsoft Office 365 , I was less than complementary in my reaction.  I did leave a lengthy comment on the blog post, but ComputerWorld thinks I’m a spammer… and it was more than 140 characters so Richi’s Twitter invitation for constructive comments for his next post (5 reasons to embrace Office 365) was not really going to work either.

Picking up Richi’s arguments against Office 365:

  • On mobility. I’ll admit, there are some issues. Microsoft doesn’t seem to understand touch user interfaces for tablets (at least not until they have their own, next year perhaps?) so the web apps are not ideal on many devices. Even so, I’m using Exchange Online with my iOS devices and the ActiveSync support means it’s a breeze. We don’t have blanket WiFi/3G coverage yet (at least not here in the UK) so it is important to think about offline working and I’m not sure Microsoft has that sorted, but neither does anyone else that I’ve found. Ideally, Microsoft would create some iOS Office apps (OneNote for iPhone is not enough – it’s not a universal app and so is next to useless on an iPad) together with an Android solution too…
  • I don’t see what the issue is with MacOS support (except that the option to purchase a subscription to Office Professional Plus is Windows-only). I’m using Office 365 with Office for Mac and SharePoint integration is not as good as on Windows but there seems nothing wrong with document format fidelity or Outlook connecting to Exchange Online. I’ve used some of the web apps on my Mac too, including Lync.
  • Is £4 a month expensive for a reliable mail and collaboration service? I’m not sure that the P1 option for professionals and small businesses (which that price relates to) is “horribly crippled” either. If the “crippling” is about a lack of support, I left Google Apps because of… a lack of support (after they “upgraded” my Google Apps account but wanted me to change the email address on my then-orphaned “personal” account – and you think Microsoft makes it complex?)
  • Forest Federation is a solution that provides clear separation between cloud and on-premise resources. It may be complicated, but so are enterprise requirements for cloud services.  If that’s too complex, then you don’t probably don’t need Active Directory integration: try a lower-level Office 365 subscription…
  • As for  reliability, yes, there have been BPOS Outages. Ditto for Azure. But didn’t Google have some high-profile GMail outages recently? And Amazon? Office 365 (which was a beta until yesterday) has been pretty solid.  Let’s hope that the new infrastructure is an improvement on BPOS, but don’t write it off yet – it’s only just launched! Microsoft is advertising a financially-backed 99.9% uptime agreement

The point of Office 365 is not to move 100% to the cloud but to “bring office to the cloud” and use it in conjunction with existing IT investments (i.e. local PCs/Macs and Office).  If I’m a small business with few IT resources, it lets me concentrate on my business, rather than running mail servers, etc. Actually, that’s the sweet spot. Some enterprises may also move to Office 365 (at least in part) but, for many, they will continue to run their mail and collaboration infrastructure in house.

Richi says that, if he were a Microsoft Shareholder, he’d be “bitterly disappointed with [yesterday’s] news”. The market seems to think otherwise… whilst Microsoft stock is generally not performing well, it’s at least rising in the last couple of days…

Microsoft stock price compared with leading IS indices over the last 12 months

To be fair, Richi wasn’t alone, but he was the one with the headline grabbing post… (would it be rude to call it linkbait?)

Over on Cloud Pro, Dennis Howlett wasn’t too impressed either. He quoted Mary Jo Foley’s Office 365 summary post:

Office 365 is not Office in the cloud, even though it does include Office Web Apps, the Webified versions of Word, Excel, PowerPoint and OneNote. Office 365 is a Microsoft-hosted suite of Exchange Online, SharePoint Online and Lync Online €” plus an optional subscription-based version of Office 2010 Professional Plus that runs locally on PCs. The Microsoft-hosted versions of these cloud apps offer subsets of their on-premises server counterparts (Exchange, SharePoint and Lync servers), in terms of features and functionality.”

Yep, that’s pretty much it. Office 365 is not about competing with Office, it’s about extending Office so that:

  • It’s attractive to small and medium-sized businesses, so that they don’t need to run their own server infrastructure.
  • There are better opportunities for collaboration, using “the cloud” as a transport (and, it has to be said, giving people less reason to move to Google Apps).

Dennis says:

“Microsoft has fallen into the trap that I see increasingly among enterprise vendors attempting to migrate their business models into the cloud: they end up with a half baked solution that does little for the user but gives some bragging rights. All the time, they seek to hang on grimly to the old business model, tinkering with it but not taking the radical steps necessary to understand working in the cloud.”

Hmm… many enterprises are not ready to put the data that is most intimately linked to their internal workings into the cloud. They look at some targeted SaaS opportunities; they might use IaaS and PaaS technologies to provide some flexibility and elasticity; they may implement cloud technologies as a “private cloud”. But Office 365 allows organisations to pick and choose the level of cloud integration that they are comfortable with – it might be all (for example, my wife’s small business) or none (for example me, working for a large enterprise), or somewhere in between.

Office 365 has some issues – I’m hoping we’ll see some more development around mobility and web app functionality – but it’s a huge step forward. After years of being told that Windows and Office are dead and that Microsoft has no future, they’ve launched something that positions the company for both software subscriptions (which they’ve been trying to do for years) and has the ability to host data on premise, in the cloud, or in a hybrid solution. “The cloud” is not for everyone, but there aren’t many organisations that can’t get something out of Office 365.

Designing a private cloud infrastructure

This content is 14 years old. I don't routinely update old blog posts as they are only intended to represent a view at a particular point in time. Please be warned that the information here may be out of date.

A couple of months ago, Facebook released a whole load of information about its servers and datacentres in a programme it calls the Open Compute Project. At around about the same time, I was sitting in a presentation at Microsoft, where I was introduced to some of the concepts behind their datacentres.  These are not small operations – Facebook’s platform currently serves around 600 million users and Microsoft’s various cloud properties account for a good chunk of the Internet, with the Windows Azure appliance concept under development for partners including Dell, HP, Fujitsu and eBay.

It’s been a few years since I was involved in any datacentre operations and it’s interesting to hear how times have changed. Whereas I knew about redundant uninterruptible power sources and rack-optimised servers, the model is now about containers of redundant servers and the unit of scale has shifted.  An appliance used to be a 1U (pizza box) server with a dedicated purpose but these days it’s a shipping container full of equipment!

There’s also been a shift from keeping the lights on at all costs, towards efficiency. Hardly surprising, given that the IT industry now accounts for around 3% of the world’s carbon emissions and we need to reduce the environmental impact.  Google’s datacentre design best practices are all concerned with efficiency: measuring power usage effectiveness; measuring managing airflow; running warmer datacentres; using “free” cooling; and optimising power distribution.

So how do Microsoft (and, presumably others like Amazon too) design their datacentres? And how can we learn from them when developing our own private cloud operations?

Some of the fundamental principles include:

  1. Perception of infinite capacity.
  2. Perception of continuous availability.
  3. Drive predictability.
  4. Taking a service provider approach to delivering infrastructure.
  5. Resilience over redundancy mindset.
  6. Minimising human involvement.
  7. Optimising resource usage.
  8. Incentivising the desired resource consumption behaviour.

In addition, the following concepts need to be adopted to support the fundamental principles:

  • Cost transparency.
  • Homogenisation of physical infrastructure (aggressive standardisation).
  • Pooling compute resource.
  • Fabric management.
  • Consumption-based pricing.
  • Virtualised infrastructure.
  • Service classification.
  • Holistic approach to availability.
  • Computer resource decay.
  • Elastic infrastructure.
  • Partitioning of shared services.

In short, provisioning the private cloud is about taking the same architectural patterns that Microsoft, Amazon, et al use for the public cloud and implementing them inside your own data centre(s). Thinking service, not server to develop an internal infrastructure as a service (IaaS) proposition.

I won’t expand on all of the concepts here (many are self-explanitory), but some of the key ones are:

  • Create a fabric with resource pools of compute, storage and network, aggregated into logical building blocks.
  • Introduced predictability by defining units of scale and planning activity based on predictable actions (e.g. certain rates of growth).
  • Design across fault domains – understand what tends to fail first (e.g. the power in a rack) and make sure that services span these fault domains.
  • Plan upgrade domains (think about how to upgrade services and move between versions so service levels can be maintained as new infrastructure is rolled out).
  • Consider resource decay – what happens when things break?  Think about component failure in terms of service delivery and design for that. In the same way that a hard disk has a number of spare sectors that are used when others are marked bad (and eventually too many fail, so the disk is replaced), take a unit of infrastructure and leave faulty components in place (but disabled) until a threshold is crossed, after which the unit is considered faulty and is replaced or refurbished.

A smaller company, with a small datacentre may still think in terms of server components – larger organisations may be dealing with shipping containers.  Regardless of the size of the operation, the key to success is thinking in terms of services, not servers; and designing public cloud principles into private cloud implementations.