Reading the owner-pet sample data

Pig

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Data Fabric
Talend Open Studio for Big Data
Talend Big Data Platform
Talend Big Data
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Processing components (Integration) > Pig components
Design and Development > Third-party systems > Processing components (Integration) > Pig components
Data Quality and Preparation > Third-party systems > Processing components (Integration) > Pig components
EnrichPlatform
Talend Studio

Procedure

  1. Double-click the main tPigLoad component to open its Component view.
  2. Click the [...] button next to Edit schema to open the schema editor and click the [+] button three times to add three rows.
  3. In the Column column, rename the new rows to owner, pet and age, respectively, and in the Type column of the age row, select Integer.
  4. Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.
  5. In the Mode area, select Map/Reduce to use the remote Hadoop cluster to be used.
  6. In the Distribution and the Version lists, select the Hadoop distribution you are using. In this example, HortonWorks Data Platform V2.1.0 (Baikal) is selected.
  7. In the Load function list, select PigStorage. Then, the corresponding parameters to be set appear.
  8. In the NameNode URI and the Resource manager fields, enter the locations of those services, respectively. If you are using WebHDFS, the location should be webhdfs://masternode:portnumber; if this WebHDFS is secured with SSL, the scheme should be swebhdfs and you need to use a tLibraryLoad in the Job to load the library required by the secured WebHDFS.
  9. Select the Set Resourcemanager scheduler address check box and enter the URI of this service in the field that is displayed. This allows you to use the Scheduler service defined in the Hadoop cluster to be used. If this service is not defined in your cluster, you can ignore this step.
  10. In the User name field, enter the name of the user having the appropriate right to write data in the cluster. In this example, it is hdfs.
  11. In the Input file URI field, enter the path pointing to the relation you need to read data from. As explained previously, the relation to be read here is the one containing the owner and pet sample data.
  12. In the Field separator field, enter the separator of the data to be read. In this example, it is semicolon (;).