A fundamental difference when comparing Hadoop with traditional SQL Engines that we’re used to is “Schema on Write” vs. “Schema on Read”.
Schema on Write should be pretty familiar to us, if we’ve ever used a SQL database product before. Our first step in loading data is that we create a table. We create the schema for it. Maybe this create table statement that creates a customer table.
Once the table exists we can load data into it. In this case we’ll bulk load data from a text file. We know the structure of the text file. It has the same structure as the table that we created. We can simply load it in with a statement similar to this one.
Then once that data is loaded in, we can query the data with a SELECT statement. That’s Schema on write.
But here’s the important point: in SQL, you can’t add data to your table until you create the table. You can’t create the table until you understand the schema of the data that’s going to be in the table.
There’s an implication to this: if the data changes…maybe that text file of customers that we’re receiving changes–maybe someone added some fields or changes some data types…what do we do?
Normally we’d have to drop the table and reload the data. That’s fine if it’s a small set of data, and there aren’t any foreign keys. But if there are foreign keys, and maybe we have 500TB of customer data–that could be a real problem. It could take days to reload that data.
Hadoop and other Big Data technologies generally use Schema on Read. What is that? Well, schema on read follows a different sequence for the processing on the previous slide.
First it loads the data. This is an HDFS command to load data into the equivalent of a customer table in Hadoop. So we would write this command and execute it.
What that will do is reach out and find all the text files in this folder that begin with “custfile” and end with “txt”, and will pull them into the HDFS system. And as that happens the data will be distributed throughout nodes and replicated.
Then we would immediately go to querying the data. This is a command that we might use to query a customer table.
The significance of this is that we never really created a schema at all. In Hadoop the data structure is interpreted as the data is read. In this case by a Python script. The data schema is whatever the mapper decides it is.
As we load data we needn’t be as concerned about the structure of that data, or even if its structure will change in the future–we can adjust for it in the mapper script later.
So as we’re analyzing data and trying to find insights, if we find that we need to reinterpret the structure of the data we don’t need to reload it.