Scenario: aggregating data from two relations using COGROUP

Pig

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Data Fabric
Talend Real-Time Big Data Platform
Talend Big Data Platform
Talend Open Studio for Big Data
Talend Big Data
task
Data Governance > Third-party systems > Processing components (Integration) > Pig components
Data Quality and Preparation > Third-party systems > Processing components (Integration) > Pig components
Design and Development > Third-party systems > Processing components (Integration) > Pig components
EnrichPlatform
Talend Studio

This scenario applies only to a Talend solution with Big Data.

For more technologies supported by Talend, see Talend components.

In this scenario, a four-component Job is designed to aggregate two relations on top of a given Hadoop cluster.

The two relations used in this scenario consist of the following sample data:
  1. Alice,turtle,17
    Alice,goldfish,17
    Alice,cat,17
    Bob,dog,18
    Bob,cat,18
    John,dog,19
    Mary,goldfish,16
    Bill,dog,20
    This relation is composed of three columns that read owner, pet and age (of the owners).
  2. Cindy,Alice
    Mark,Alice
    Paul,Bob
    Paul,Jane
    John,Mary
    William,Bill
    This relation provides a list of students' names alongside their friends, of which some are pet owners displayed in the first relation. Therefore, the schema of this relation contains two columns: student and friend.

Before replicating this scenario, you need to write the sample data into the HDFS system of the Hadoop cluster to be used. To do this, you can use tHDFSOutput. For further information about this component, see tHDFSOutput.

The data used in this scenario is inspired by the examples that Pig's documentation uses to explain the GROUP and the GOGROUP operators. For related information, please see Apache's documentation for Pig.