Aggregating data from two relations using COGROUP - 7.2

Pig

author
EnrichVersion
7.2
task

For more technologies supported by Talend, see Talend components.

In this scenario, a four-component Job is designed to aggregate two relations on top of a given Hadoop cluster.

The two relations used in this scenario consist of the following sample data:
  1. Alice,turtle,17
    Alice,goldfish,17
    Alice,cat,17
    Bob,dog,18
    Bob,cat,18
    John,dog,19
    Mary,goldfish,16
    Bill,dog,20
    This relation is composed of three columns that read owner, pet and age (of the owners).
  2. Cindy,Alice
    Mark,Alice
    Paul,Bob
    Paul,Jane
    John,Mary
    William,Bill
    This relation provides a list of students' names alongside their friends, of which some are pet owners displayed in the first relation. Therefore, the schema of this relation contains two columns: student and friend.

Before replicating this scenario, you need to write the sample data into the HDFS system of the Hadoop cluster to be used. To do this, you can use tHDFSOutput.

The data used in this scenario is inspired by the examples that Pig's documentation uses to explain the GROUP and the GOGROUP operators. For related information, please see Apache's documentation for Pig.