The most common way to connect traditional Business Intelligence tools to data stored in Hadoop is using Apache Hive.  What is Hive?  In short, in Hadoop 1.0, Hive is a middleware layer that lets us write queries in SQL syntax that are rewritten as MapReduce jobs to be run by the Hadoop MR subsystem.  For a  more detailed description of Hive, see the previous video lesson on the underlying process Hive uses to run SQL queries.

The overall process is not fundamentally different than how we would use Hive on any Hadoop distribution–the biggest change being the Azure specific Hive web console, which we’ll use in this video.

While Hive might seem as the “silver bullet” that eliminates the need for any other way to access data in Hadoop, it should be considered one tool in a toolbox–not an end all and be all.  Why?  The answer lies in the fact that SQL is by definition a “schema on write” technology–meaning before any query is formed, the underlying table schemas need to be described to the database.  Hive respects this by requiring tables to be defined before queries are executed.  This means that the data we query with Hive must be “structured”, and all the individual HDFS files that make up a table must have the same field layout.

If we have data that has unknown structure–or inconsistent structure from one file to the next within a table–Hive will accept our query, but won’t return the results we expect.  For these scenarios we may still need to use other methods, like C#, Java or Python MapReduce, Pig, or other other alternatives.  In a production system we probably will combine the two approaches–MapReduce to turn unstructured HDFS data into structured data that Hive can then query effectively.



Apache Hive



Comments are closed.