How to set up data lineage with Cloudera Navigator - 6.5

Talend Big Data Studio User Guide

EnrichVersion
6.5
EnrichProdName
Talend Big Data
task
Design and Development
EnrichPlatform
Talend Studio

The support for Cloudera Navigator has been added to Talend MapReduce Jobs and Spark Jobs.

If you are using Cloudera V5.5+ to run your Jobs, you can make use of Cloudera Navigator to trace the lineage of given data flow to discover how this data flow was generated by a MapReduce or Spark Job, including the components used in this Job and the schema changes between the components.

For example, assume that you have designed the following MapReduce Job and you want to generate lineage information about it:

You need to proceed as follows:

  1. Click Run to open its view and then click the Hadoop configuration tab (For a Spark Job, the tab to be used is Spark configuration).

  2. From the Distribution list, select Cloudera and from the Version list, select Cloudera 5.5.

    Then the Use Cloudera Navigator check box is displayed.

    With this option activated, you need to set the following parameters:

    • Username and Password: this is the credentials you use to connect to your Cloudera Navigator.

    • Cloudera Navigator URL : enter the location of the Cloudera Navigator to be connected to.

    • Cloudera Navigator Metadata URL: enter the location of the Navigator Metadata.

    • Activate the autocommit option: select this check box to make Cloudera Navigator generate the lineage of the current Job at the end of the execution of this Job.

      Since this option actually forces Cloudera Navigator to generate lineages of all its available entities such as HDFS files and directories, Hive queries or Pig scripts, it is not recommended for the production environment because it will slow the Job.

    • Kill the job if Cloudera Navigator fails: select this check box to stop the execution of the Job when the connection to your Cloudera Navigator fails.

      Otherwise, leave it clear to allow your Job to continue to run.

    • Disable SSL validation: select this check box to make your Job to connect to Cloudera Navigator without the SSL validation process.

      This feature is meant to facilitate the test of your Job but is not recommended to be used in a production cluster.

Till now, the connection to Cloudera Navigator has been set up. The time when you run this Job, the lineage will be automatically generated in Cloudera Navigator.

Note that you still need to configure the other parameters in the Hadoop configuration tab in order to successfully run the Job. For further information, see the example for a MapReduce Job in the Getting Started Guide of the Studio.

When the execution of the Job is done, perform a search in Cloudera Navigator for the data written by this Job and see the lineage of this data in Cloudera Navigator.

If you compare this lineage graph with the Job in the Studio, you can see that every component is presented in this graph and you can expand the icon of each component to read the schema it uses.