Setting up data lineage with Atlas

The support for Apache Atlas has been added to Talend Spark Jobs.

If you are using Hortonworks Data Platform V2.4 onwards to run your Jobs and Apache Atlas has been installed in your Hortonworks cluster, you can make use of Atlas to trace the lineage of given data flow to discover how this data was generated by a Spark Job, including the components used in this Job and the schema changes between the components. If you are using CDP Private Cloud Base or CDP Public Cloud to run your Jobs and Apache Atlas has been installed in your cluster, you can also make use of Atlas.

Depending on the Hortonworks Data Platform version you are using, Talend Studio supports the following Atlas version for:

Hortonworks Data Platform V2.4, the Talend Studio supports Atlas 0.5 only.
Hortonworks Data Platform V2.5, the Talend Studio supports Atlas 0.7 only.
Hortonworks Data Platform V3.14, the Talend Studio support Atlas 1.1 only.

For example, assume that you have designed the following Spark Batch Job and you want to generate lineage information about it in Atlas:

In this Job, tRowGenerator is used to generate the input data, tMap and tSortRow to process the data and the other components to output data into different formats.

Procedure

Click Run to open its view and then click the Spark configuration tab.
From the Distribution list and the Version list, select your Hortonworks distribution. The Use Atlas check box is displayed.
With this option activated, you need to set the following parameters:
- Atlas URL: enter the location of the Atlas to be connected to. It is often http://name_of_your_atlas_node:port
- In the Username and Password fields, enter the authentication information for access to Atlas.
- Set Atlas configuration folder: if your Atlas cluster contains custom properties such as SSL or read timeout, select this check box, and in the displayed field, enter a directory in your local machine, then place the atlas-application.properties file of your Atlas in this directory. This way, your Job is enabled to use these custom properties.
  
  You need to ask the administrator of your cluster for this configuration file. For further information about this file, see the Client Configs section in Atlas configuration.
- Die on error: select this check box to stop the Job execution when Atlas-related issues occur, such as connection issues to Atlas. Otherwise, leave it clear to allow your Job to continue to run.

Results

Till now, the connection to Atlas has been set up. The time when you run this Job, the lineage will be automatically generated in Atlas.

Note that you still need to configure the other parameters in the Spark configuration tab in order to successfully run the Job. For further information, see Creating Spark Batch Jobs.

When the execution of the Job is done, perform a search in Atlas for the lineage information written by this Job and read the lineage there.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here

Setting up data lineage with Atlas

Procedure

Results

In this section

Did this page help you?