If you follow “Big Data” in the industry press at any level, I’m sure you’ve seen this term “NoSQL”. What is this “NoSQL” thing anyway?
Well to me it’s kind of inflammatory, isn’t it? Having SQL with a red cross through it… But really NoSQL isn’t supposed to mean “we hate SQL”. It just means the underlying data access doesn’t use SQL. Hadoop and MapReduce at their most basic level (when they’re dealing with the data itself) don’t use SQL, they use other data access mechanisms.
So while that’s true, it doesn’t really mean “without SQL” either. Hive, for example, is a wildly popular access language for HDFS. This is something developed at Facebook to allow SQL analysts to access the data that MapReduce technologies store.
It puts a SQL layer on top HDFS. But it’s fair to say that so much access of Hadoop and other MapReduce data is done using SQL that a term like “NoSQL” may not really be the best.
Maybe a way to think about it–and some have suggested this–is that the term should be taken to mean “Not Only SQL”, because technologies like Hadoop do extend traditional SQL in different ways and do some things SQL can’t do.
But at the end we’re probably going to use a lot of SQL when we use Hadoop in our solutions. Let’s consider an example where we have some data sources that are structured, and others are unstructured.
Perhaps we’d like to consolidate this, and we’ll use Hadoop or some other NoSQL technology in order to do it. We might do this: we might use Integration Services from the SQL Server platform to pull structured data from our line-of-business applications and put it into our data warehouse. We’re pretty accustomed to doing this.
We may occasionally also use SSIS to pull out things like sensor logs or call records. These are things we can often figure out how to model and pull into a data warehouse.
But we might also use a NoSQL technology like Hadoop to pull in information that isn’t that easy to model in a relational data warehouse so we can preprocess it and turn it into something we can store in a relational data warehouse.
Once we’ve done that we might use technologies like ODBC or Sqoop to pull that information out of the MapReduce environment in a structured format and model that into our data warehouse.
Ultimately that enables our users to use SQL to analyze data that perhaps made it into the data warehouse directly, along with information that Hadoop was used to boil into something that could be easily modeled and analyzed in the relational data warehouse model.