Defining data lineage with Cloudera Navigator - Cloud

If you are using Cloudera V5.5+ to run your MapReduce or Apache Spark Batch Jobs, you can make use of Cloudera Navigator to trace the lineage of given data flow to discover how this data flow was generated by a Job.

This lineage includes the components used in this Job and the schema changes between the components.

This type of Job is available only if you have subscribed to any Talend product with Big Data or to Talend Data Fabric.

Procedure

In the configuration view, which is the Hadoop configuration view of the Run tab for a MapReduce Job and the Spark configuration view of the Run tab for a Spark Batch Job, select the Use Cloudera Navigator check box.

With this option activated, you need to set the following parameters:

Username and Password: this is the credentials you use to connect to your Cloudera Navigator.
Cloudera Navigator URL : enter the location of the Cloudera Navigator to be connected to.
Cloudera Navigator Metadata URL: enter the location of the Navigator Metadata.
Activate the autocommit option: select this check box to make Cloudera Navigator generate the lineage of the current Job at the end of the execution of this Job.

Since this option actually forces Cloudera Navigator to generate lineages of all its available entities such as HDFS files and directories, Hive queries or Pig scripts, it is not recommended for the production environment because it will slow the Job.
Kill the Job if Cloudera Navigator fails: select this check box to stop the execution of the Job when the connection to your Cloudera Navigator fails.

Otherwise, leave it clear to allow your Job to continue to run.
Disable SSL validation: select this check box to make your Job to connect to Cloudera Navigator without the SSL validation process.

This feature is meant to facilitate the test of your Job but is not recommended to be used in a production cluster.

Results

When you run this Job, the lineage will be automatically generated in Cloudera Navigator.

When the execution of the Job is done, perform a search in Cloudera Navigator for the data written by this Job and see the lineage of this data in Cloudera Navigator.