Hortonworks Tutorial ‘How-to-process-data-with-apache-pig’ in Talend

In the first part we showed how to run a simple scenario for data processing with pig on HDFS and the Hortonworks sandbox using Talend Big Data solution.
In this tutorial we explore a slightly more complex data processing task using Pig Script and Tez for faster processing.

The complete tutorial from Hortonworks can be found here:

Downloading example data
For this tutorial, we are going to read in a baseball statistics file. We are going to compute the highest runs by a player for each year. This file has all the statistics from 1871–2011 and it contains over 90,000 rows. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

The data file we are using comes from the site www.seanlahman.com. You can download the data file in csv zip: lahman591-csv.zip
Once you have the file you will need to unzip the file into a directory. We will be uploading just the batting.csv file.
Copy the file to the Talend Sandbox for processing.


Tez is Hindi for “speed” provides a general-purpose, highly customizable framework that creates simplifies data-processing tasks across both small scale (low-latency) and large-scale (high throughput) workloads in Hadoop.
It generalizes the MapReduce paradigm to a more powerful framework by providing the ability to execute a complex DAG (directed acyclic graph) of tasks for a single job so that projects in the Apache Hadoop ecosystem such as Apache Hive, Apache Pig and Cascading can meet requirements for human-interactive response times and extreme throughput at petabyte scale (clearly MapReduce has been a key driver in achieving this).

Loading the sample data into HDFS
First we make a connection to the Hadoop file system and do a cleanup of the files so we can rerun the scenario.

HDFS Connection

Enter de Hadoop distribution and version for the components. Make sure to use the same for all components.

HCatalog Database

The HDFSPut component copies the file from the local directory to HDFS test directory.
We are using the HDFS connection set up in the first subjob.

HCatalog Database

Apache Pig™
Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

Processing data
We will be using pig components to calculate the player with the highest runs for each year.

The basic steps will be:
Step 1: Loading the data
Step 2: Filter columns
Step 3: Find max runs grouped by year
Step 4: Join to retrieve player(s)
Step 5: Store the result

In Pig Script the following commands will be executed

batting = load ‘Batting.csv’ using PigStorage(‘,’);
raw_runs = FILTER batting BY $1>0;
runs = FOREACH raw_runs GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
DUMP join_data;

In Talend the same steps are performed by 6 pre defined pig components. No coding is necessary.
Below you can see the talend sub job that executes the same logic:
Pig Job

Step 1: Load Batting.csv
The first thing we need to do is load the data. We use the tPigLoad component for this.
We use PigStorage as Load Function and we pass it a comma as the field separator.
You can choose different load functions from the dropdown list in the component. Other options are ie. TextLoader, HCatLoader and HbaseLoader.
Using the mode radio buttons you can easily select where you want the job execute in our case Tez.

Upload File

After that we are ready to analyse the data with Apache Pig™.

Step2: Filter the columns
Use tPigMap to filter the columns needed for the analysis. We set a filter on the output the eliminate any rows with a run total of zero.

Pig Mapping

Step3: Select records with highest number of runs per year

Using the tPigAggregate component we can easily group records for the incoming columns (year) and also perform a function the column runs where we want the highest number per year.

Pig Aggregate

Step4: Join Player info

With the component tPigJoin we can easily lookup data in other files for referencing. The created output is filtered using a tPigMap so we only output the required columns to the result file.

Pig Store

Step 5: Store the result

Upload File

This is how the complete job looks.
Job Overview

After we run the job the following result is displayed.
Result in File browser

As you can see in below excerpt from the result file, some years have two entries, because multiple players scored the highest runs.
Job Overview

Job Overview

This tutorial was created using Talend Big Data Sandbox with Hortonworks 2.1 and Talend Data Fabric version 6.0.1

Pin it

Frank is director of XSed. XSed is a value added reseller for Talend Open Data Solutions. In accordance with the open source philosophy we focus on collaboration and open communication.

Website: http://www.xsed.nl

1 Comment

  1. Nice post. I learn something more demanding on different sites everyday.
    It’ll constantly be stimulating to read content from other writers
    and practice a little something from their shop. I’d prefer to use some with the content on my blog whether you
    do’t mind. Natually I’ll give you a link in your internet blog.

    Thanks for sharing.

Geef een reactie

Het e-mailadres wordt niet gepubliceerd. Verplichte velden zijn gemarkeerd met *