Configuring tHDFSInput

Deduplication

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Data Services Platform
Talend ESB
Talend Open Studio for Big Data
Talend Big Data
Talend Open Studio for ESB
Talend Big Data Platform
Talend Real-Time Big Data Platform
Talend Open Studio for Data Integration
Talend Open Studio for MDM
Talend Data Management Platform
Talend Data Integration
Talend MDM Platform
Talend Data Fabric
task
Data Quality and Preparation > Third-party systems > Data Quality components > Deduplication components
Design and Development > Third-party systems > Data Quality components > Deduplication components
Data Governance > Third-party systems > Data Quality components > Deduplication components
EnrichPlatform
Talend Studio

Procedure

  1. Double-click tHDFSInput to open its Component view.
  2. Click the button next to Edit schema to verify that the schema received in the earlier steps is properly defined.
    Note that if you are creating this Job from scratch, you need to click the button to manually add these schema columns; otherwise, if the schema has been defined in Repository, you can select the Repository option from the Schema list in the Basic settings view to reuse it. For further information about how to define a schema in Repository, see the chapter describing metadata management in the Talend Studio User Guide or the chapter describing the Hadoop cluster node in Repository of the Getting Started Guide.
  3. If you make changes in the schema, click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.
  4. In the Folder/File field, enter the path, or browse to the source file you need the Job to read.
    If this file is not in the HDFS system to be used, you have to place it in that HDFS, for example, using tFileInputDelimited and tHDFSOutput in a Standard Job.