Before we dive into Hadoop and the MapReduce paradigm, let’s take a moment to consider the trends that are driving this Big Data discussion in the first place.
I think there are four key trends. The first is Expanding Data Volumes; followed by the Increasing variety of business data types. Third is the Economics of high volume data processing. And finally: Hype.
One of the challenges with Big Data is that the word “Big” is highly subjective. So what you think is “Big” may not be the same thing I think is “Big”. Hadoop was designed for data volumes that really wouldn’t fit in traditional data processing environments.
So it is accurate to say that Big Data really is targeted toward the 100TB to Petabyte volume of data sets. But that’s not to say that the technologies and techniques that came out of that research don’t apply to smaller data sets. As we go through this material we’ll see there are use cases in the mid-range data sets that fit extremely well with Hadoop and Map Reduce.
The second trend is in the increasing variety of data. ON the right side of this diagram we have highly structured data, like Customer Data and Orders and so on. This type of data lends itself perfectly to relational databases. Relational databases were designed to model traditional transactional information. This is what relational was designed for.
But what relational databases weren’t designed for is handling things like social media and device records and documents and logs. These are things that have either structure that doesn’t exist, or structure that changes frequently.
In these areas our SQL databases and data warehouses have been a bit challenged to model and store information. However this unstructured data is the type of data that fits very well into the big data paradigm.
The third trend deals with the economics of high volume data processing. If we accept that our data volumes are increasing–and I think we probably do–then we have to decide what technology that we’re going to use to process those increasing volumes over time.
The processing model most of us use is symmetrical multiprocessing (SMP). This is essentially one server that, as the data volume goes up and processing demand increases, we add more CPUs, memory and so on to one server. Or maybe we buy a new server that has more capacity.
In an SMP environment, as we can see on the chart, as the data volume increases the cost remains somewhat flat. So it’s very economical to a point. But as we start to push data volumes even more, we can see that the infrastructure cost increases at a disproportional rate.
This is intuitively true. A 4-way server with 32GB of RAM is relatively cheap. A 32-way server with several terabytes of RAM is very expensive. So as the data volumes begin to demand the high-end, large-scale SMP hardware, we start to experience very expensive hardware costs.
The alternative to SMP is Massively Parallel Processing (MPP). In an MPP environment we see a relatively linear relationship between data volume and infrastructure cost. So that means that as the data volume increases, the infrastructure cost is relatively linear.
In the small and medium size data volumes, it actually may mean that it would cost us more to process data with MPP vs. SMP. But there is point of inflection where MPP continues at a linear rate while SMP increases geometrically. For this reason when we consider “Big Data Technologies”, it is with the larger data volumes where we get the best cost per data volume compared with SMP.
This relationship between SMP and MPP applies not just to Hadoop and non-SQL technologies, but also to our SQL products as well. For example, SQL Server is available in an SMP architecture, which is the version most of us use most of the time. It’s also available in an MPP version called “Parallel Data Warehouse”.
Other MPP SQL environments include Netezza and Teradata.
The final trend? Big data is being hyped. We’re used to this in the technology industry. When new technologies emerge, they tend to get a lot of hype. Everyone jumps on the bandwagon. We need to separate that hype from reality.
There is a lot of hype in the press. For example this Forbes article that summarized an Oracle study. Forbes quoted Oracle as concluding “The Deadly Cost of Ignoring Big Data: $71.2 Million per Year”. This referred to the opportunity cost for a company that didn’t use the technology would forego this kind of revenue. Certainly compelling–if it’s true.
IBM said that “Every day we create 2.5 quintillion bytes of data…”
[I'm not even sure how much data that is, but it's certainly a lot]
, that “so much that 90% of the data in the world today has been created in the last two years alone.” Amazing!
Gartner, though, says that Big Data is one of the most hyped terms in the market today. They went on to explain why it’s over-hyped. IT World, though, says Gartner is “dead wrong”, and that expectations aren’t inflated enough.
So if Big Data is hyped, then the question is whether that’s justified? Certainly if there was a disruptive technology that solved all the problems on the previous slide, we’d want it even if it’s hyped too much.
The NY Times says big data is about finding patterns in the noise of vast unstructured data sets. We certainly want that. Can you imagine what it’d be like to browse the web without a search engine? Search engines wouldn’t exist without big data.
Forrester says, “In this information age, the firms that best turn information to their advantage will dominate their competition.” Well, of course we want to do that! They go on to say that “big data will play a big part in helping them do it.”
McKinsey says, “The use of big data will become a key basis of competition and growth for individual firms.” Of course we all want to have the “key basis” in our organizations too!
So, in summary, these are the four trends I think are most driving Big Data and putting it on our roadmap.
1. Expanding data volumes. Clearly if we have vast amounts of information that don’t fit our traditional systems we need to look for a different technology to help.
2. Increasing variety of business data types. In truth we’ve always had a variety of business data types in our business. But there’s much of it that we never really put in our data warehouse and never made available to our users to analyze because we didn’t have the technology to do it. Maybe big data is a way to do that?
3. Economics of data processing at high volume. Hadoop and other Big Data technologies were designed for high volumes in mind, so if we have a high volume of data we need to evaluate what technologies we need to work with this information.