Can we process “Big Data” in the cloud?

This content is 13 years old. I don't routinely update old blog posts as they are only intended to represent a view at a particular point in time. Please be warned that the information here may be out of date.

I wrote last week about one of the presentations I saw at the recent Unvirtual conference and this post highlights another one of the lightning talks – this time on a subject that was truly new to me: Big Data.

Tim Moreton (@timmoreton), from Acunu, spoke about using big data in the cloud: making it “elastic and sticky” and I’m going to try and get the key points across in this post. Let’s hope I get it right!

Essentially, “big data” is about collecting, analysing and servicing massive volumes of data.  As the Internet of things becomes a reality, we’ll hear more and more about big data (being generated by all those sensors) but Tim made the point that it often arrives suddenly: all of a sudden you have a lot of users, generating a lot of data.

Tim explained that key ingredients for managing big data are storage and compute resources but it’s actually about more than that: it’s not just any storage or compute resource because we need high scalability, high performance, and low unit costs.

Compute needs to be elastic so that we can fire up (virtual) cloud instances at will to provide additional resources for the underlying platform (e.g. Hadoop). Spot pricing, such as that provided by Amazon, allows a maximum price to be set, to process the data at times when there is surplus capacity.

The trouble with big data and the cloud is virtualisation. Virtualisation is about splitting units of hardware to increase utilisation, with some overhead incurred (generally CPU or IO) – essentially multiple compute resources are combined/consolidated.  Processing big data necessitates combining machines for massive parallelisation – and that doesn’t sit too well with cloud computing: at least I’m not aware of too many non-virtualised elastic clouds!

Then, there’s the fact that data is decidedly sticky.  It’s fairly simple to change compute providers but how do you pull large data sets out of one cloud and into another? Amazon’s import/export involves shipping disks in the post!

Tim concluded by saying that there is a balance to be struck.  Cloud computing and big data are not mutually exclusive but it is necessary to account for the costs of storing, processing and moving the data.  His advice was to consider the value (and the lock-in) associated with historical data, to process data close to its source, and to look for solutions that a built to span multiple datacentres.

[Update: for more information on “Big Data”, see Acunu’s Big Data Insights microsite]

Azure Connect – the missing link between on-premise and cloud

This content is 14 years old. I don't routinely update old blog posts as they are only intended to represent a view at a particular point in time. Please be warned that the information here may be out of date.

Azure Connect offers a way to connect on-premise infrastructure with Windows Azure but it’s lacking functionality that may hinder adoption.

While Microsoft is one of the most dominant players in client-server computing, until recently, its position in the cloud seemed uncertain.  More recently, we’ve seen Microsoft lay out its stall with both Software as a Service (SaaS) products including Office 365 and Platform as a Service (PaaS) offerings such as Windows Azure joining their traditional portfolio of on-premise products for consumers, small businesses and enterprise customers alike.

Whereas Amazon’s Elastic Compute Cloud (EC2) and Simple Storage Service (S3) offer virtualised Infrastructure as a Service (IaaS) and Salesforce.com is about consumption of Software as a Service (SaaS), Windows Azure fits somewhere in between. Azure offers compute and storage services, so that an organisation can take an existing application, wrap a service model around it and specify how many instances to run, how to persist data, etc.

Microsoft also provides middleware to support claims based authentication and an application fabric that allows simplified connectivity between web services endpoints, negotiating firewalls using outbound connections and standard Internet protocols. In addition, there is a relational database component (SQL Azure), which exposes relational database services for cloud consumption, in addition to the standard Azure table storage.

It all sounds great – but so far everything I’ve discussed runs on a public cloud service and not all applications can be moved in their entirety to the cloud.

Sometimes makes it makes sense to move compute operations to the cloud and keep the data on-premise (more on that in a moment). Sometimes, it’s appropriate to build a data hub with multiple business partners connecting to a data source in cloud but with applications components in a variety of locations.

For European CIOs, information security, in particular data residency, is a real issue. I should highlight that I’m not a legal expert, but CIO Magazine recently reported how the Patriot Act potentially gives the United States authorities access to data hosted with US-based service providers – and selecting a European data centre won’t help.  That might make CIOs nervous about placing certain types of data in the cloud although they might consider a hybrid cloud solution.

Azure already provides federated security, application layer connectivity (via AppFabric) and some options for SQL Azure data synchronisation (currently limited to synchronisation between Microsoft data centres, expanding later this year to include synchronisation with on-premise SQL Server) but the missing component has been the ability to connect Windows Azure with on-premise infrastructure and applications. Windows Azure Connect provides this missing piece of the jigsaw.

Azure Connect is a new component for Windows Azure that provides secure network communications between compute instances in Azure and servers on premise (ie behind the corporate firewall). Using standard IP protocols (both TCP and UDP) it’s possible to take a web front end to the cloud and leave the SQL Server data on site, communicating over a virtual private network, secured with IPSec. In another scenario, a compute instance can be joined to an on-premise Active Directory  domain so a cloud-based application can take advantage of single sign-on functionality. IT departments can also use Azure Connect for remote administration and troubleshooting of cloud-based computing instances.

Currently in pre-release form, Microsoft is planning to make Azure Connect available during the first half of 2011. Whilst setup is relatively simple and requires no coding, Azure Connect is reliant on an agent running on the connected infrastructure (ie on each server that connects to Azure resources) in order to establish IPSec connectivity (a future version of Azure Connect will be able to take advantage of other VPN solutions). Once the agent is installed, the server automatically registers itself with the Azure Connect relay in the cloud and network policies are defined to manage connectivity. All that an administrator has to do is to enable Windows Azure roles for external connectivity via the service model; enable local computers to initiate an IPSec connection by installing the Azure Connect agent; define network policies and, in some circumstances, define appropriate outbound firewall rules on servers.

The emphasis on simplicity is definitely an advantage as many Azure operations seem to require developer knowledge and this is definitely targeted at Windows Administrators. Along with automatic IPSec provisioning (so no need for certificate servers) Azure Connect makes use of DNS so that there is no requirement to change application code (the same server names can be used when roles move between the on premise infrastructure and Azure).

For some organisations though, the presence of the Azure Connect agent may be seen as a security issue – after all, how many database servers are even Internet-connected? That’s not insurmountable but it’s not the only issue with Azure Connect.

For example, connected servers need to run Windows Vista, 7, Server 2008, or Server 2008 R2 [a previous version of this story erroneously suggested that only Windows Server 2008 R2 was supported] and many organisations will be running their applications on older operating system releases. This means that there may be server upgrade costs to consider when integrating with the cloud – and it certainly rules out any heterogeneous environments.

There’s an issue with storage. Windows Azure’s basic compute and storage services can make use of table-based storage. Whilst SQL Azure is available for applications that require a relational database, not all applications have this requirement – and SQL Azure presents additional licensing costs as well as imposing additional architectural complexity.  A significant number of cloud-based applications make use of table storage or combination of table storage and SQL Server – for them, the creation of a hybrid model for customers that rely on on-premise data storage may not be possible.

For many enterprises, Azure Connect will be a useful tool in moving applications (or parts of applications) to the cloud. If Microsoft can overcome the product’s limitations, it could represent a huge step forward for Microsoft’s cloud services in that it provides a real option for development of hybrid cloud solutions on the Microsoft stack, but there still some way to go.

[This post was originally written as an article for Cloud Pro.]