In the last lesson we introduced the idea of MPP, or massively parallel processing. What is this?
Well, in general an MPP processing model takes a very large data set and divides it into partitions. Let’s say we take a year’s data and divide it into 12 partitions–one for each month.
Each of those partitions is distributed to an independent computing node. Each node usually has its own RAM, CPUs and storage.
So in our month-by-month example we might have 12 nodes–12 servers, each processing only one month’s data. In this way no server has to be big enough to process the entire year. Each only has to process one month worth of data.
Then we’ll have a control node of some kind. That control node will martial all the other nodes and coordinate them. They’ll all work on their own “piece” of the query.
The control node will consolidate the work that all the other nodes are doing and return that to the client. From the client’s point of view the database looks like it resides on one server. Actually, in this yearly example, it may be 13 servers.
SQL MPP has been well established. Microsoft has a product called Parallel Data Warehouse, which is based on SQL Server. Teradata has been in this space for many years. Netezza also has been on the market for years.
Non-SQL MPP also is very well established. But it’s well established mainly in Internet properties. Google, Amazon, Yahoo, etc., typically use non-SQL MPP underlying their systems. That’s how they process such vast amounts of information. They don’t have one server or “mainframe”. They have farms of distributed computing nodes working together.
Let’s compare SMP and MPP in a little more “real world” context. This is an SMP dump truck! So if I wanted to move 400 tons of coal, I could buy this dump truck, and it would do the job.
This obviously is the kind of truck used in large mining operations. The approximate cost of this dump truck is $5M dollars. This dump truck is at capacity. I can’t expand this truck at all. If I want to move more coal, I have to buy another one at $5M, and then I can begin to use its capacity. Very expensive!
The alternative to my SMP dump truck would be a cluster of dump trucks. Or an “MPP Dump Truck Cluster”. This cluster of seven dump trucks are 24 tons each, so they can move 175 tons working together. The cost of all seven will be about $1M.
So you can see that if I double that to 14 dump trucks in my cluster I can move about 350 tons of coal for about $2M. So in aggregate I’m spending less than I would if I invested in an SMP dump truck.
The same basic economics apply to Hadoop clusters. Hadoop clusters generally harness low-cost individual servers together to do the work of what might be a very large-scale, expensive server.
This is a photo of a 63-node Hadoop cluster that was installed at Boise State University. If you look closely, these are just PCs–ones that you might find under a desk. Maybe that’s where these came from?
But the Hadoop software is marshalling these together and making them work as a coordinated whole. Why would you do this?
Probably one reason would be cost. Here we’ll compare a high-end SMP server on the left, and a PC-based cluster on the right.
The high-end SMP server can be ordered with 32 cores–which will max it out. We can’t add another core to that server. If we want another core, we have to buy another server.
On the PC cluster side, if we price out (in 2012) we can get a PC with an Intel i5. If we order 63 of those we’ll have 252 cores.
The high-end SMP server can be configured with a maximum of 16TB internal storage. We could hook up external SAN storage, but that would add cost. So we’ll just configure with the maximum that we can fit inside the server’s cabinet. That’s 16TB–and it’s maxed out.
On the PC-based cluster side, we could put a 1TB drive in each of the 63 machines and have 63TB of storage. So again, we’ve got some more capacity on the right.
The SMP server can be ordered with a maximum of 2TB RAM. We can’t add more than that. On the PC side we can order PCs with 32GB RAM, and 63 of those will give us 2TB RAM (the same amount as the SMP server).
The cost comparison shows very different outcomes. The SMP server configured like this will (in 2012) cost about $157,000. The PC-based cluster will cost about $44,100.
So if–and that’s an “if”–we could process the same workload on the cluster that we can on the high-end SMP server, we’ll save a lot of money!
Another advantage to a Hadoop cluster model is the cost of scale. If we wanted to add another 10 nodes to our PC-based Hadoop cluster, we would just need to buy 10 more nodes at a cost of about $7K. This would scale up every element of the system without needing to replace the hardware already purchased.
From a systems management point of view, though, you might look at the SMP server and decide you’d rather have it, because it may be easier to manage than a bunch of PCs sitting on bread racks. And that might be right.
So it’s important to point out that you can buy this kind of Hadoop cluster in different packages. One way is what we just looked at–building your own from commodity hardware. But you have to manage them all as individual pieces of hardware.
If you’re worried about managing a lot of nodes on a bread rack (or in a server rack), and you’d rather manage just one piece of hardware with one support contract you can buy a pre-integrated appliance like this one that’s pre-integrated by EMC. This will probably cost more than building your own system using commodity hardware, but it does have turn-key vendor support.
A second approach is using cloud-based Hadoop clusters. In this way you can rent the number of nodes you need for only the amount of time you need them. This can have significant benefits in terms of hardware depreciation. It also allows much more agility in sizing the hardware needed for your Hadoop cluster. This works pretty well as long as the data you need can be moved into the cloud to be processed by a Hadoop cluster.