One of the reasons to use Hadoop as part of your data warehouse strategy is to take advantage of its ability to process data in a distributed way–massively parallel processing, or MPP.  Another is to leverage its “schema on read” approach when processing unstructured data.

In data warehousing terms, reading data from a source system is known as ETL, or “Extract/Transform/Load”.  In MPP systems, it’s typically more efficient to transpose the T and L letters and use the “Extract/Load/Transform” pattern.  Why? Because this pattern allows data transformation to leverage the full breadth of distributed processing nodes, resulting in superior performance.

Pig, which is implements the PigLatin data flow language for Hadoop, is the most commonly used ELT technology in Hadoop clusters. In this introduction-level lesson we’ll look at a simple Pig script to process and transform IIS web logs.  While this is shown on an HDInsight cluster within the Azure cloud platform, the process is identical for any Hadoop cluster.

Category:

Apache Pig

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *

*