In the first part we showed how to run a simple scenario for data processing with pig on HDFS and the Hortonworks sandbox using Talend Big Data solution.
In this tutorial we explore a slightly more complex data processing task using Pig Script and Tez for faster processing.
The complete tutorial from Hortonworks can be found here:
Downloading example data
For this tutorial, we are going to read in a baseball statistics file. We are going to compute the highest runs by a player for each year. This file has all the statistics from 1871–2011 and it contains over 90,000 rows. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.
The data file we are using comes from the site www.seanlahman.com. You can download the data file in csv zip: lahman591-csv.zip
Once you have the file you will need to unzip the file into a directory. We will be uploading just the batting.csv file.
Copy the file to the Talend Sandbox for processing.
Tez is Hindi for “speed” provides a general-purpose, highly customizable framework that creates simplifies data-processing tasks across both small scale (low-latency) and large-scale (high throughput) workloads in Hadoop.
It generalizes the MapReduce paradigm to a more powerful framework by providing the ability to execute a complex DAG (directed acyclic graph) of tasks for a single job so that projects in the Apache Hadoop ecosystem such as Apache Hive, Apache Pig and Cascading can meet requirements for human-interactive response times and extreme throughput at petabyte scale (clearly MapReduce has been a key driver in achieving this).
Loading the sample data into HDFS
First we make a connection to the Hadoop file system and do a cleanup of the files so we can rerun the scenario.
Enter de Hadoop distribution and version for the components. Make sure to use the same for all components.
The HDFSPut component copies the file from the local directory to HDFS test directory.
We are using the HDFS connection set up in the first subjob.
Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.
We will be using pig components to calculate the player with the highest runs for each year.
The basic steps will be:
Step 1: Loading the data
Step 2: Filter columns
Step 3: Find max runs grouped by year
Step 4: Join to retrieve player(s)
Step 5: Store the result
In Pig Script the following commands will be executed
batting = load ‘Batting.csv’ using PigStorage(‘,’);
raw_runs = FILTER batting BY $1>0;
runs = FOREACH raw_runs GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
Step 1: Load Batting.csv
The first thing we need to do is load the data. We use the tPigLoad component for this.
We use PigStorage as Load Function and we pass it a comma as the field separator.
You can choose different load functions from the dropdown list in the component. Other options are ie. TextLoader, HCatLoader and HbaseLoader.
Using the mode radio buttons you can easily select where you want the job execute in our case Tez.
After that we are ready to analyse the data with Apache Pig™.
Step2: Filter the columns
Use tPigMap to filter the columns needed for the analysis. We set a filter on the output the eliminate any rows with a run total of zero.
Step3: Select records with highest number of runs per year
Using the tPigAggregate component we can easily group records for the incoming columns (year) and also perform a function the column runs where we want the highest number per year.
Step4: Join Player info
With the component tPigJoin we can easily lookup data in other files for referencing. The created output is filtered using a tPigMap so we only output the required columns to the result file.
Step 5: Store the result
This tutorial was created using Talend Big Data Sandbox with Hortonworks 2.1 and Talend Data Fabric version 6.0.1