Kate's Comment

Thoughts on British ICT, energy & environment, cloud computing and security from Memset's MD

Big Data: Don’t Believe The Hype

Scarcely a day passes by where you don’t see a headline about “Big Data” and how analysis of this big data is going to lead to huge efficiencies, targeted marketing and large profits. But are there really huge profits to be made out of data analytics (or data mining as it used to be known), both for companies collecting data and their new data analytics hired guns? We were recently a finalist in the BCS UK IT industry awards for one of my own “big data” projects and this has given me some insight.

The science industry is perhaps a prime example of data analytics, examples include genetics, big pharmaceutical companies and programmes like the Large Hadron Collider (LHC) all crunching data sets of literally unimaginable size. However there are less examples in business – unless you’re operating at mega-scale like Amazon, Google, LinkedIn (Graph), or in big-brand consumer space (eg. a retail bank, supermarket or video on demand service).

So where is the drive coming from? It is a combination of factors; 1) storage has become really cheap, 2) abundant, cheap computing is available to all sizes of business on a pay-as-you-use mode, thanks to cloud computing and 3) as more businesses are interating and transacting electronically there is more data to be had by all.

When do you need big data techniques?

In short I would suggest that there are only two cases where you actually need “big data” techniques such as non-relational databases or progressive ways of reducing data (eg. Hadoop and elastic map-reduce):

  • Where your data set is so large that you can’t crunch it on one modern machine without giving $m’s to Oracle so you want to span it across multiple, cheap nodes.
  • Where you want to be able to do data analytics in real time on large data sets (eg. LinkedIn) and need to distribute the database to achieve that

As a principal though I’d say big data is more about moving the intelligence away from the data structure and into software. There is also a lot of talk about structured vs. unstructured; in fact all machine-generated data is structured in some form! Some may be a bit “dirty”, but that’s traditional database administrator (DBA) type techniques, or it may be in a crude log file which needs structuring – so there are levels of structure.

The same could be said of big data itself; there are different levels and shades of grey. In my own pet-project; the analysis of an hourly log of all our VM and server resource utilisation which over 3 years is now several tens of billions of data points. I first took a crudely structured file (a big CSV dump in fact!) and wedged it into an SQlite database. SQlite is like MySQL or other relational databases but really pared down to the bone, so you need more intelligence in the software. However it means it can cope with tables containing billions of rows whereas MySQL and others start to creak. That take took one powerful machine (a £3k Dell server) about two days, but once in the relational structure I could perform analysis on the set in a matter of minutes per query.

Social data

Some people include social media as “big data” but personally I don’t think it fits. First, humans don’t generate much content. We can only create about 80 bytes per minute (average typing speed) and images and other high-bandwidth content doesn’t count since you can’t meaningfully analyse it generally (though there are exceptions). Any data source large enough to warrant big data techniques is generally going to be machine-generated.

Second, I personally believe that any organisation trying to apply big data analytics to social media channels have fundamentally missed the point! You can’t use machines to interact with people over social channels – you should have people reading the social feeds and responding appropriately. For even large companies this is very unlikely to require more than one or two people at most as well. There are exceptions of course, for example I happen to know that some household names which run popular media services scrape Twitter in real time as an early warning monitoring system of issues with their service’s performance. Again, however, such use-cases are very limited to those companies with tens of millions of real-time users.

Why anyone can analyze all this big data

The “Big Data” analytics vendors walk a fine line of claiming: It’s becoming possible to analyze all this big data, but only they can do it – but not the average analyst or manager (who actually has the industry knowledge to boot).

This is not true. You don’t need to spend lots of money on big data. First, don’t worry about the storage; storage is very cheap these days (it is a poorly-kept secret that Amazon are making outrageous profits on EC2 for instance) and the sorts of data sets most people are talking about are actually quite “thin” anyway – not images/movies etc. Even a trillion data points each taking 10 bytes each would only take 10 TBytes of storage – not so large you can’t get it on one machine even!

Second, most competant developer-DBAs are more than capable of manipulating very large data set, and getting existing staff to tackle such challenges is a great opportunity to expand their skill set.

What I’d say is more exciting is that the capacity to collect what were once very large data sets, of the order of billions of items, and that for 99% of the requirements that can be analysed with traditional techniques (relational SQL databases) on commodity server hardware.

Using Standard Tools

I’m also highly skeptical of the need to spend money on expensive tools. We accumulate about 5 million data points per day, which we push into open source SQL databases (MySQL and SQLite). Our development team are more than capable of creating the analysis tools we need without any “big data” or “data scientist” skills, nor by spending money on expensive proprietary tools.

If you really do have a requirement for big data techniques then there are plenty of free, open source tools out there for you too. Hadoop is of course the most famous example. Again, I’d encourage you to let existing tech staff play with this technology rather than spending big money on something fancy.

A lot of the issues with modern databases actually arise from the code too. We’ve found that by moving the intelligence out of the database structures and into software models (we’re fans of Django for example and escew stored procedures) we’ve retained the convenience of a modern database without suffering on performance. We also spend a good amount of time on code and database optimisation

Another example of moving intelligence out of the database is in SQlite. SQlite is best known as being a little database used on mobile devices and such, but being so light it is also very efficient, and quite capable of handling tables with trillions of rows. We actually use SQLite for testing our code base, again exemplifying the advantages of having intelligence in the software since we can “plug in” pretty much any old database.

In Conclusion

If you are a mega-scale, household brand online retailer, bank, supermarket or social network then yes it might be worth investing in novel, non-relational data base systems. However, at Memset with our not inconsiderable expertise we doubt there are many problems that cannot be cracked with enough RAM in your server, well-written code (often poor database structure or code is to blame for delays!), traditional techniques and a little bit of patience (our analysies sometimes take a few minutes to run)!

Further, we’re unconvinced about the longevity of big data non-relational techniques. Some very smart people working on the problem of automatically scaling relational databases across multiple nodes in the SQL camp and this is where I would put my money. It is 30+ year old technology well-understood by the development community and when those folks crack the scaling problem the use cases for Hadoop and co will diminish greatly.

Finally, don’t go running off to fancy contractors trying to flog you the latest snake oil. Encourage your existing technical team to explore this area and challenge them to start collecting and using bigger data sets – it will be good for them and cheaper for you!