Loading the input data and removing duplicates

Pig

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Open Studio for Big Data
Talend Data Fabric
Talend Big Data
Talend Big Data Platform
Talend Real-Time Big Data Platform
task
Data Quality and Preparation > Third-party systems > Processing components (Integration) > Pig components
Design and Development > Third-party systems > Processing components (Integration) > Pig components
Data Governance > Third-party systems > Processing components (Integration) > Pig components
EnrichPlatform
Talend Studio

Procedure

  1. Double-click tPigLoad to open its Basic settings view.
  2. Click the [...] button next to Edit schema to open the [Schema] dialog box.
  3. Click the [+] button to add three columns according to the data structure of the input file: Name (string), Country (string) and Age (integer), and then click OK to save the setting and close the dialog box.
  4. Click Local in the Mode area.
  5. Fill in the Input file URI field with the full path to the input file.
  6. Select PigStorage from the Load function list, and leave rest of the settings as they are.
  7. Double-click tPigDistinct to open its Basic settings view, and click Sync columns to make sure that the input schema structure is correctly propagated from the preceding component.
    This component will remove any duplicates from the data flow.