tPigMap - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

tPigMap properties

Component family

Big Data / Pig

 

Function

tPigMap is fine-tuned for transforming and routing the data in a Pig process. It provides a graphic interface that enables sophisticated configuration of multiple data flows.

Purpose

tPigMap transforms and routes data from single or multiple sources to single or multiple destinations.

 Basic settings

Mapping links display as

Auto: the default setting is curves links

Curves: the mapping display as curves

Lines: the mapping displays as straight lines. This last option allows to slightly enhance performance.

 

Map editor

It allows you to define the tPigMap routing and transformation properties.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

Possible uses are from a simple reorganization of fields to the most complex Jobs of data multiplexing or demultiplexing transformation, concatenation, inversion, filtering/splitting and more, in a Pig process.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Limitation

The use of tPigMap supposes minimum Java and Pig Latin knowledge in order to fully exploit its functionalities.

This component is a junction step, and for this reason cannot be a start nor end component in the Job.

Optional map settings

On the input side:

Lookup properties

Value

Join Model

Inner Join;

Left Outer Join;

Right Outer Join;

Full Outer Join.

The default join option is Left Outer Join when you do not activate this option settings panel by displaying it. These options perform the join of two or more flows based on common field values.

When more than one lookup tables need joining, the main input flow starts the joining from the first lookup flow, then uses the result to join the second and so on in the same manner until the last lookup flow is joined.

Join Optimization

None;

Replicated;

Skewed;

Merge.

The default join option is None when you do not activate this option settings panel by displaying it. These options are used to perform more efficient join operations. For example, if you are using the parallelism of multiple reduce tasks, the Skewed join can be used to counteract the load imbalance problem if the data to be processed is sufficiently skewed.

Each of these options is subject to the constraints explained in Apache's documentation about Pig Latin.

Custom Partitioner

Enter the Hadoop partitioner you need to use to control the partitioning of the keys of the intermediate map-outputs. For example, enter, in double quotation marks,

org.apache.pig.test.utils.SimpleCustomPartitioner

to use the partitioner SimpleCustomPartitioner.

For further information about the code of this SimpleCustomPartitioner, see Apache's documentation about Pig Latin. The jar file of this partitioner must have been registered in the Register jar table in the Advanced settings view of the tPigLoad component linked with the tPigMap component to be used.

Increase Parallelism

Enter the number of reduce tasks. For further information about the parallel features, see Apache's documentation about Pig Latin..

On the output side:

Output properties

Value

Catch Output Reject

True;

False.

This option, once activated, allows you to catch the records rejected by a filter you can define in the appropriate area.

Catch Lookup Inner Join Reject

True;

False.

This option, once activated, allows you to catch the records rejected by the inner join operation performed on the input flows.

Scenario: Joining data about road conditions in a Pig process

The Job in this scenario uses two tPigLoad components to read data about the traffic conditions and the related events on given roads from a given Hadoop distribution, joins and filters the data using tPigMap, and writes the results into that Hadoop distribution using two tPigStoreResult.

The Hadoop distribution to be used is keeping the data about traffic situation such as normal or jam and the data about the traffic-related events such as road work, rain and even no event. In this example, the data to be used reads as follows:

  1. The traffic situation data stored in the directory /user/ychen/tpigmap/date&traffic:

    2013-01-11 00:27:53;Bayshore Freeway;jam
    2013-02-28 07:01:18;Carpinteria Avenue;jam
    2013-01-26 11:27:59;Bayshore Freeway;normal
    2013-03-07 20:48:51;South Highway;jam
    2013-02-07 07:40:10;Lindbergh Blvd;normal
    2013-01-22 17:13:55;Pacific Hwy S;normal
    2013-03-17 23:12:26;Carpinteria Avenue;normal
    2013-01-15 08:06:53;San Diego Freeway;jam
    2013-03-19 15:18:28;Monroe Street;jam
    2013-01-20 05:53:12;Newbury Road;normal
  2. The event data stored in the directory /user/ychen/tpigmap/date&event:

    2013-01-11 00:27:53;Bayshore Freeway;road work
    2013-02-28 07:01:18;Carpinteria Avenue;rain
    2013-01-26 11:27:59;Bayshore Freeway;road work
    2013-03-07 20:48:51;South Highway;no event
    2013-02-07 07:40:10;Lindbergh Blvd;second-hand market
    2013-01-22 17:13:55;Pacific Hwy S;no event
    2013-03-17 23:12:26;Carpinteria Avenue;no event
    2013-01-15 08:06:53;San Diego Freeway;second-hand market
    2013-03-19 15:18:28;Monroe Street;road work
    2013-01-20 05:53:12;Newbury Road;no event

For any given moment shown in the timestamps in the data, one row is logged to reflect the traffic situation and another row to reflect the traffic-related event. You need to join the data into one table in order to easily detect how the events on a given road are impacting the traffic.

Note

The data used in this example is a sample with limited size.

To replicate this scenario, ensure that the Studio to be used has the appropriate right to read and write data in that Hadoop distribution and then proceed as follows:

Linking the components

  1. In the Integration perspective of Talend Studio, create an empty Job, named pigweather for example, from the Job Designs node in the Repository tree view.

    For further information about how to create a Job, see the Talend Studio User Guide.

  2. Drop two tPigLoad components, tPigMap and two tPigStoreResult onto the workspace.

    The components can be labelled if needs be. In this scenario, we label the two tPigLoad components as traffic and event, respectively, which load accordingly the traffic data and the related event data. Then we label the two tPigStoreResult components as normal and jam, respectively, which write accordingly the results to the Hadoop distribution to be used. For further information about how to label a component, see the Talend Studio User Guide.

  3. Right-click the tPigLoad component labeled traffic to connect it to tPigMap using the Row > Pig combine link from the contextual menu.

  4. Repeat this operation to link the tPigLoad component labeled event to tPigMap, too. As this is the second link created, it becomes automatically the lookup link.

  5. Use the Row > Pig combine link again to connect tPigMap to each of the two tPigStoreResult components.

    You need to name these links in the dialog box popped up once you select the link type from the contextual menu. In this scenario, we name the link to tPigStoreResult labeled normal as out and the link to tPigStoreResult labeled jam as reject.

Configuring tPigLoad

Loading the traffic data

  1. Double-click the tPigLoad labeled traffic to open its Component view.

  2. Click the button next to Edit schema to open the schema editor.

  3. Click the button three times to add three rows and in the Column column, rename them as date, street and traffic, respectively.

  4. Click OK to validate these changes.

  5. In the Mode area, select the Map/Reduce option, as we need the Studio to connect to a remote Hadoop distribution.

  6. In the Distribution list and the Version field, select the Hadoop distribution to be used. In this example, it is Hortonworks Data Platform V1.0.0.

  7. In the Load function list, select the PigStorage function to read the source data, as the data is a structured file in human-readable UTF-8 format.

  8. In the NameNode URI and the JobTracker host fields, enter the locations of the master node and the Job tracker service of the Hadoop distribution to be used, respectively.

  9. In the Input file URI field, enter the directory where the data about the traffic situation is stored. As explained earlier, the directory in this example is /user/ychen/tpigmap/date&traffic.

  10. In the Field separator field, enter ; depending on the separator used by the source data.

Loading the event data

  1. Double-click the tPigLoad labeled event to open its Component view.

  2. Click the button next to Edit schema to open the schema editor.

  3. Click the button three times to add three rows and in the Column column, rename them as date, street and event, respectively.

  4. Click OK to validate these changes.

  5. In the Mode area, select Map/Reduce.

    As you have configured the connection to the given Hadoop distribution in that first tPigLoad component, traffic, this event component reuses that connection and therefore, the corresponding options in the Distribution and the Version lists have been automatically selected.

  6. In the Load function field, select the PigStorage function to read the source data.

  7. In the Input file URI field, enter the directory where the event data is stored. As explained earlier, the directory in this example is "/user/ychen/tpigmap/date&event".

Configuring tPigMap

  • Double-click tPigMap to open its Map Editor view.

Creating the output schema

  1. On the input side (left side) of the Map Editor, each of the two tables represents one of the input flow, the upper one for the main flow and the lower one for the lookup flow.

    On the output side (right side), the two tables represent the output flows that you named as out1 and reject earlier.

    From the main flow table, drop its three columns onto each of the output flow table.

  2. From the lookup flow, drop the event column onto each of the output flow table.

    Then from the Schema editor view, you can see the schemas of the both sides have been completed and as well, click each table to display its schema in this view.

Setting the mapping conditions

  1. On the lookup flow table, click the button to open the setting panel in this table.

  2. In the Join Model row, select Left Outer Join to ensure that every record of the main flow is included in this join.

  3. On the out1 output flow table, click the button to display the editing field for the filter expression.

  4. Enter

    'normal'== row1.traffic

    This allows tPigMap to output only the traffic records reading normal in the out1 flow.

  5. On the reject output flow table, click the button to open the setting panel.

  6. In the Catch Output Reject row, select true to output the traffic records reading jam in the reject flow.

  7. Click Apply, then click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.

Configuring tPigStoreResult

  1. Double-click the tPigStoreResult labeled normal to open its Component view.

  2. In the Result file field, enter the directory you need to write the result in. In this scenario, it is /user/ychen/tpigmap/traffic_normal, which receives the records reading normal.

  3. Select Remove result directory if exists check box.

  4. In the Store function list, select PigStorage to write the records in human-readable UTF-8 format.

  5. In the Field separator field, enter ;.

  6. Repeat the same operations to configure the tPigStoreResult labeled jam, but set the directory, in the Result file field, as /user/ychen/tpigmap/traffic_jam.

Note

If either of the components does not retrieve its schema from tPigMap, a warning icon appears. In this case, click the Sync columns button to retrieve the schema from the preceding one and once done, the warning icon disappears.

Executing the Job

Then you can press F6 to run this Job.

Once done, verify the results in the Hadoop distribution used.

From the traffic_jam records, you can analyze what event is often going on in the meantime of a traffic jam and from the traffic_normal records, how the smooth traffic situation is maintained.

If you need to obtain more details about the Job, it is recommended to use the web console of the Jobtracker provided by the Hadoop distribution you are using.

In JobHistory, you can easily find the execution status of your Pig Job because the name of the Job is automatically created by concatenating the name of the project that contains the Job, the name and version of the Job itself and the label of the first tPigLoad component used in it. The naming convention of a Pig Job in JobHistory is ProjectName_JobNameVersion_FirstComponentName.