Hortonworks Tutorial – ‘Pig Basics’ in Talend

This blog shows how the Hortonworks Tutorial for Pig can be easily created in Talend Big Data.
The tutorial describes how to load a file to the hortonworks platform using hCatalog and then executing a pig script to analyze the data.

The complete tutorial from Hortonworks can be found here:

Downloading example data
For this tutorial, we’ll be using stock ticker data from the New York Stock Exchange from the years 2000-2001. You can download this file here:

Apache HCatalog
The function of HCatalog is to hold location and metadata about the data in a Hadoop cluster. This allows scripts and MapReduce jobs to be decoupled from data location and metadata like the schema. Additionally since HCatalog supports many tools, like Hive and Pig, the location and metadata can be shared between tools. Using the open APIs of HCatalog other tools like Talend can also use the location and metadata in HCatalog. In the tutorials we will see how we can now reference data by name and we can inherit the location and metadata.

Loading the sample data into HCatalog
First we will register it with HCatalog to be able to access it in both Pig and Hive.
Also we make a connection to the hadoop file system and do a cleanup of the files so we can rerun the scenario.

HCatalog Components

Enter de Hadoop distribution and version for the components. Make sure to use the same for all components.

Hadoop Distribution

Then define the database parameters in component tHCatalogOperation_2

HCatalog Database

In the next component tHCatalogOperation_1 we will create the table.
HCatalog Database

And we define the schema of the table which matches the file we will upload.
HCatalog Table Schema

It is important to set the correct table configuration on the advanced tab of the component.
As the imported file is tab delimited you need to set the row format field to a tab value.

HCatalog Table

The following steps include uploading the file to HDFS and then loading it into hCatalog.

Upload File

The HDFSPut component copies the file from the local directory to HDFS test directory.
We are using the HDFS connection set up in the first subjob.
Upload File

Next step is to load the file into HCatalog.

Upload File

After that we are ready to analyse the data with Apache Pig™.

Apache Pig™
Pig is a language for expressing data analysis and infrastructure processes. Pig is translated into a series of MapReduce jobs that are run by the Hadoop cluster. Pig is extensible through user-defined functions that can be written in Java and other languages. Pig scripts provide a high level language to create the MapReduce jobs needed to process data in a Hadoop cluster.

Pig Basics

We will be using pig to calculte the average stock volume for IBM stocks.

The basic steps will be:
Step 1: Loading the data
Step 2: Select all records starting with IBM
Step 3: iterate and average
Step 4: Store the result

In Pig Script the following commands will be executed

a = LOAD ‘nyse_stocks’ using org.apache.hcatalog.pig.HcatLoader();
b = filter a by stock_symbol == ‘IBM’;
c = group b all;
d = for each c generate AVG(b.stock_volume);
dump d;

In Talend the same steps are performed by 4 pre defined pig components. No coding is necessary.

Upload File

Load the data using Hcatloader which is set using the Load Function parameter and we specify the database and table.
Pig Load

Filter the rows for stocks of IBM.
Pig Filter

Calculate average
Pig Aggregate

Step 4:
Store the result
Pig Store

Finally we read the created result file and show the result in a tLogrow.

This is how the complete job looks.
Job Overview

After we run the job the following result is displayed.
Job Overview

This tutorial was created using Talend Big Data Sandbox with Hortonworks 2.0 and Talend Big Data Open Studio version 5.6.1.

Pin it

Frank is director of XSed. XSed is a value added reseller for Talend Open Data Solutions. In accordance with the open source philosophy we focus on collaboration and open communication.

Website: http://www.xsed.nl

Geef een reactie

Het e-mailadres wordt niet gepubliceerd. Verplichte velden zijn gemarkeerd met *