Skip to main content Skip to complementary content

Setting up data lineage with Cloudera Navigator

The support for Cloudera Navigator has been added to Talend Spark Jobs.

If you are using Cloudera V5.5+ to run your Jobs, you can make use of Cloudera Navigator to trace the lineage of given data flow to discover how this data flow was generated by a Spark Job, including the components used in this Job and the schema changes between the components.

If you are using CDP Private Cloud Base or CDP Public Cloud to run your Jobs, it is recommended to use Apache Atlas. If you are using CDP dynamic distribution, Apache Atlas is used rather than Cloudera Navigator. For more information, see Setting up data lineage with Atlas.

For example, assume that you have designed the following Job and you want to generate lineage information about it:

Spark Jobs running with MapReduce.

Procedure

  1. Click Run to open its view and then click the Hadoop configuration tab (For a Spark Job, the tab to be used is Spark configuration).
  2. From the Distribution list, select Cloudera and from the Version list, select Cloudera 5.5. Then the Use Cloudera Navigator check box is displayed.

    With this option activated, you need to set the following parameters:

    • Username and Password: this is the credentials you use to connect to your Cloudera Navigator.

    • Cloudera Navigator URL : enter the location of the Cloudera Navigator to be connected to.

    • Cloudera Navigator Metadata URL: enter the location of the Navigator Metadata.

    • Activate the autocommit option: select this check box to make Cloudera Navigator generate the lineage of the current Job at the end of the execution of this Job.

      Since this option actually forces Cloudera Navigator to generate lineages of all its available entities such as HDFS files and directories, Hive queries or Pig scripts, it is not recommended for the production environment because it will slow the Job.

    • Kill the job if Cloudera Navigator fails: select this check box to stop the execution of the Job when the connection to your Cloudera Navigator fails. Otherwise, leave it clear to allow your Job to continue to run.
    • Disable SSL validation: select this check box to make your Job to connect to Cloudera Navigator without the SSL validation process.

      This feature is meant to facilitate the test of your Job but is not recommended to be used in a production cluster.

Results

Till now, the connection to Cloudera Navigator has been set up. The time when you run this Job, the lineage will be automatically generated in Cloudera Navigator.

Note that you still need to configure the other parameters in the Spark configuration tab in order to successfully run the Job.

When the execution of the Job is done, perform a search in Cloudera Navigator for the data written by this Job and see the lineage of this data in Cloudera Navigator.

If you compare this lineage graph with the Job in Talend Studio, you can see that every component is presented in this graph and you can expand the icon of each component to read the schema it uses.

Lineage graph in Cloudera Navigator.

Cloudera Navigator uses a Cloudera SDK library to provide functionalities and must be compatible with the version of this SDK library. The version of your Cloudera Navigator is determined by the Cloudera Manager installed with your Cloudera distribution and the compatible SDK is automatically used based on the version of your Navigator.

However, not all the Cloudera Navigator versions have their compatible SDK versions. For more details about the Cloudera SDK versions and their compatible Navigator versions, see the Cloudera documentation about Cloudera Navigator SDK Version Compatibility.

For information about Cloudera Navigator versions supported by Talend Studio, see Supported Cloudera Navigator versions for Talend Jobs.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!