This blog shows how the Hortonworks Tutorial for Pig can be easily created in Talend Big Data.
The tutorial describes how to load a file to the hortonworks platform using hCatalog and then executing a pig script to analyze the data.
The complete tutorial from Hortonworks can be found here:
Downloading example data
For this tutorial, we’ll be using stock ticker data from the New York Stock Exchange from the years 2000-2001. You can download this file here:
The function of HCatalog is to hold location and metadata about the data in a Hadoop cluster. This allows scripts and MapReduce jobs to be decoupled from data location and metadata like the schema. Additionally since HCatalog supports many tools, like Hive and Pig, the location and metadata can be shared between tools. Using the open APIs of HCatalog other tools like Talend can also use the location and metadata in HCatalog. In the tutorials we will see how we can now reference data by name and we can inherit the location and metadata.
Loading the sample data into HCatalog
First we will register it with HCatalog to be able to access it in both Pig and Hive.
Also we make a connection to the hadoop file system and do a cleanup of the files so we can rerun the scenario.
Enter de Hadoop distribution and version for the components. Make sure to use the same for all components.
Then define the database parameters in component tHCatalogOperation_2
It is important to set the correct table configuration on the advanced tab of the component.
As the imported file is tab delimited you need to set the row format field to a tab value.
The following steps include uploading the file to HDFS and then loading it into hCatalog.
Next step is to load the file into HCatalog.
After that we are ready to analyse the data with Apache Pig™.
Pig is a language for expressing data analysis and infrastructure processes. Pig is translated into a series of MapReduce jobs that are run by the Hadoop cluster. Pig is extensible through user-defined functions that can be written in Java and other languages. Pig scripts provide a high level language to create the MapReduce jobs needed to process data in a Hadoop cluster.
We will be using pig to calculte the average stock volume for IBM stocks.
The basic steps will be:
Step 1: Loading the data
Step 2: Select all records starting with IBM
Step 3: iterate and average
Step 4: Store the result
In Pig Script the following commands will be executed
a = LOAD ‘nyse_stocks’ using org.apache.hcatalog.pig.HcatLoader();
b = filter a by stock_symbol == ‘IBM’;
c = group b all;
d = for each c generate AVG(b.stock_volume);
In Talend the same steps are performed by 4 pre defined pig components. No coding is necessary.
Finally we read the created result file and show the result in a tLogrow.
This tutorial was created using Talend Big Data Sandbox with Hortonworks 2.0 and Talend Big Data Open Studio version 5.6.1.