Can we process “Big Data” in the cloud?

This content is 14 years old. I don't routinely update old blog posts as they are only intended to represent a view at a particular point in time. Please be warned that the information here may be out of date.

I wrote last week about one of the presentations I saw at the recent Unvirtual conference and this post highlights another one of the lightning talks – this time on a subject that was truly new to me: Big Data.

Tim Moreton (@timmoreton), from Acunu, spoke about using big data in the cloud: making it “elastic and sticky” and I’m going to try and get the key points across in this post. Let’s hope I get it right!

Essentially, “big data” is about collecting, analysing and servicing massive volumes of data.  As the Internet of things becomes a reality, we’ll hear more and more about big data (being generated by all those sensors) but Tim made the point that it often arrives suddenly: all of a sudden you have a lot of users, generating a lot of data.

Tim explained that key ingredients for managing big data are storage and compute resources but it’s actually about more than that: it’s not just any storage or compute resource because we need high scalability, high performance, and low unit costs.

Compute needs to be elastic so that we can fire up (virtual) cloud instances at will to provide additional resources for the underlying platform (e.g. Hadoop). Spot pricing, such as that provided by Amazon, allows a maximum price to be set, to process the data at times when there is surplus capacity.

The trouble with big data and the cloud is virtualisation. Virtualisation is about splitting units of hardware to increase utilisation, with some overhead incurred (generally CPU or IO) – essentially multiple compute resources are combined/consolidated.  Processing big data necessitates combining machines for massive parallelisation – and that doesn’t sit too well with cloud computing: at least I’m not aware of too many non-virtualised elastic clouds!

Then, there’s the fact that data is decidedly sticky.  It’s fairly simple to change compute providers but how do you pull large data sets out of one cloud and into another? Amazon’s import/export involves shipping disks in the post!

Tim concluded by saying that there is a balance to be struck.  Cloud computing and big data are not mutually exclusive but it is necessary to account for the costs of storing, processing and moving the data.  His advice was to consider the value (and the lock-in) associated with historical data, to process data close to its source, and to look for solutions that a built to span multiple datacentres.

[Update: for more information on “Big Data”, see Acunu’s Big Data Insights microsite]

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.