Defining data lineage with Atlas

If you are using Hortonworks Data Platform V2.4 onwards to run your MapReduce or Spark Batch Jobs and Apache Atlas has been installed in your Hortonworks cluster, you can make use of Atlas to trace the lineage of given data flow to discover how this data was generated by a Job.

This linage includes the components used in this Job and the schema changes between the components.

This type of Job is available only if you have subscribed to any Talend product with Big Data or to Talend Data Fabric.

If you are using Hortonworks Data Platform V2.4, the Studio supports Atlas 0.5 only; if you are using Hortonworks Data Platform.V2.5, the Studio supports Atlas 0.7 only.

Procedure

In the configuration view, which is the Hadoop configuration view of the Run tab for a MapReduce Job and the Spark configuration view of the Run tab for a Spark Batch Job, select the Use Atlas check box.

With this option activated, you need to set the following parameters:

Atlas URL: enter the location of the Atlas to be connected to. It is often http://name_of_your_atlas_node:port
In the Username and Password fields, enter the authentication information for access to Atlas.
Set Atlas configuration folder: if your Atlas cluster contains custom properties such as SSL or read timeout, select this check box, and in the displayed field, enter a directory in your local machine, then place the atlas-application.properties file of your Atlas in this directory. This way, your Job is enabled to use these custom properties.

You need to ask the administrator of your cluster for this configuration file. For further information about this file, see the Client Configs section in Atlas configuration.
Die on error: select this check box to stop the Job execution when Atlas-related issues occur, such as connection issues to Atlas.

Otherwise, leave it clear to allow your Job to continue to run.

If you are using Hortonworks Data Platform V2.4, the Studio supports Atlas 0.5 only; if you are using Hortonworks Data Platform.V2.5, the Studio supports Atlas 0.7 only.

Results

The time when you run this Job, the lineage will be automatically generated in Atlas.

When the execution of the Job is done, perform a search in Atlas for the lineage information written by this Job and read the lineage there.

In Atlas, the lineage written by a Job consists of two types of entities:

the Job itself
the components in the Job that are using data schemas, such as tRowGenerator or tSortRow. The connection or configuration components such as tHDFSConfiguration are not taken into account since these components do not use schemas.