If you are using Cloudera V5.5+ to run your MapReduce or Apache Spark Batch Jobs, you can make use of Cloudera Navigator to trace the lineage of given data flow to discover how this data flow was generated by a Job.
This lineage includes the components used in this Job and the schema changes between the components.
With this option activated, you need to set the following parameters:
Username and Password: this is the credentials you use to connect to your Cloudera Navigator.
Cloudera Navigator URL : enter the location of the Cloudera Navigator to be connected to.
Cloudera Navigator Metadata URL: enter the location of the Navigator Metadata.
Activate the autocommit option: select this check box to make Cloudera Navigator generate the lineage of the current Job at the end of the execution of this Job.
Since this option actually forces Cloudera Navigator to generate lineages of all its available entities such as HDFS files and directories, Hive queries or Pig scripts, it is not recommended for the production environment because it will slow the Job.
Kill the job if Cloudera Navigator fails: select this check box to stop the execution of the Job when the connection to your Cloudera Navigator fails.
Otherwise, leave it clear to allow your Job to continue to run.
Disable SSL validation: select this check box to make your Job to connect to Cloudera Navigator without the SSL validation process.
This feature is meant to facilitate the test of your Job but is not recommended to be used in a production cluster.
When you run this Job, the lineage will be automatically generated in Cloudera Navigator.
When the execution of the Job is done, perform a search in Cloudera Navigator for the data written by this Job and see the lineage of this data in Cloudera Navigator.