Editing the converted Job - 7.1

Talend Data Fabric Getting Started Guide

author
Talend Documentation Team
EnrichVersion
7.1
EnrichProdName
Talend Data Fabric
task
Data Quality and Preparation > Cleansing data
Data Quality and Preparation > Profiling data
Design and Development
Installation and Upgrade
EnrichPlatform
Talend Administration Center
Talend DQ Portal
Talend Installer
Talend Runtime
Talend Studio
You update the components, when necessary, to finalize a data transformation process that runs in the Spark framework.

Before you begin

  • Ensure that the client machine on which the Talend Jobs are executed can recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add the IP address/hostname mapping entries for the services of that Hadoop cluster in the hosts file of the client machine.

    For example, if the host name of the Hadoop Namenode server is talend-cdh550.weave.local, and its IP address is 192.168.x.x, the mapping entry reads 192.168.x.x talend-cdh550.weave.local.

  • The Hadoop cluster to be used has been properly configured and is running.

    The Cloudera CDH V5.5 cluster used in this use case integrates Spark by default.

  • The administrator of the cluster has given read/write rights and permissions to the username to be used for the access to the related data and directories in HDFS.

Procedure

  1. In the Repository, double-click the aggregate_movie_director_spark_batch Job to open it in the workspace.

    A tHDFSConfiguration component has been added automatically and inherits the configuration for the connection to HDFS from the original MapReduce Job.

    The icons indicate that the components that are used in the original Job do not exist in the current Job framework, Spark Batch. They are tHDFSInput and tHDFSOutput in this example.

  2. Click tHDFSInput to select it and then in the popup Warning window, click OK to close this window.
  3. Press Delete on your keyboard to remove tHDFSInput.
  4. In the Job workspace, enter tFileInputDelimited and select this component from the list that appears.

    tFileInputDelimited is added to the workspace.

  5. Do the same to replace tHDFSOutput with tFileOutputDelimited.
  6. Expand the Hadoop cluster node under the Metadata node in the Repository and then the my_cdh connection node and its child node to display the movies schema metadata node you have set up under the HDFS folder as explained in Preparing file metadata.
  7. Drop this schema metadata node onto the new tFileInputDelimited component in the workspace of the Job.
  8. Right-click this tFileInputDelimited component, then from the contextual menu, select Row > Main and click tMap to connect it to tMap.
  9. Right-click tMap, then from the context menu, select Row > out1 and click the new tFileOutputDelimited to connect tMap to this component.
  10. Double-click the new tFileOutputDelimited component to open its Component view.
  11. In the Folder field, enter or browse to the directory you need to write the result in. In this scenario, it is /user/ychen/output_data/spark_batch/out, which receives the records that contain the names of the movie directors.
  12. Select the Merge result to single file check box in order to merge the part- files typically generated by Spark into one single file.

    The Merge file path field is displayed.

  13. In the Merge file path field, enter or browse to the file into which you want the part-part- files to merge.

    In this scenario, this file is /user/ychen/output_data/spark_batch/out/merged.

  14. Double-click the other tFileOutputDelimited component which receives the reject link from tMap to open its Component view.
  15. In the Folder field, set the directory to /user/ychen/output_data/spark_batch/reject.
  16. In the Run view, click the Spark configuration tab to verify that the Hadoop/Spark connection metadata has been properly inherited from the original Job.

    You always need to use this Spark Configuration tab to define the connection to a given Hadoop/Spark distribution for the whole Spark Batch Job and this connection is effective on a per-Job basis.

  17. If you are not sure that the Spark cluster is able to resolve the hostname of the machine where the Job is executed, select the Define the driver hostname or IP address check box and in the field that is displayed, enter the IP address of this machine.

    If you leave this check box clear, the Spark cluster looks at the machine located at 127.0.0.1, that is to say, the machine within the cluster itself for the Spark driver.

  18. Press F6 to run the Job.

Results

The Run view is automatically opened in the lower part of the Studio and shows the execution progress of this Job.

Once done, you can check, for example in the web console of your HDFS system, that the output has been written in HDFS.