How to set up data lineage with Atlas (Technical Preview) - 6.2

Talend Real-time Big Data Platform Studio User Guide

Talend Real-Time Big Data Platform
Talend Studio
Data Quality and Preparation
Design and Development

The support for Apache Atlas has been added to Talend MapReduce Jobs and Spark Jobs for technical preview.

If you are using Hortonworks Data Platform V2.4 to run your Jobs and Apache Atlas has been installed in your Hortonworks cluster, you can make use of Atlas to trace the lineage of given data flow to discover how this data was generated by a MapReduce or Spark Job, including the components used in this Job and the schema changes between the components.

For example, assume that you have designed the following Spark Batch Job and you want to generate lineage information about it in Atlas:

In this Job, tHDFSConfiguration (labelled c55_docker_01_HDFS) is used to define the connection to HDFS; tRowGenerator is used to generate the input data, tSortRow and tReplicate to process the data and the other components to output data into different formats.

You need to proceed as follows:

  1. Click Run to open its view and then click the Spark configuration tab (For a MapReduce Job, the tab to be used is Hadoop configuration).

  2. From the Distribution list, select Hortonworks and from the Version list, select Hortonworks Data Platform V2.4.0.

    Then the Use Atlas check box is displayed.

    With this option activated, you need to set the following parameters:

    • Atlas URL : enter the location of the Atlas to be connected to. It is often http://name_of_your_atlas_node:port

    • Die on error: select this check box to stop the Job execution when Atlas-related issues occur, such as connection issues to Atlas.

      Otherwise, leave it clear to allow your Job to continue to run.

The authentication information used by the Job is also used for access to Atlas.

Till now, the connection to Atlas has been set up. The time when you run this Job, the lineage will be automatically generated in Atlas.

Note that you still need to configure the other parameters in the Spark configuration tab in order to successfully run the Job. For further information, see the example for a Spark Batch Job in the Getting Started Guide of the Studio, or any scenario using Spark Batch Jobs in Talend Components Reference Guide.

When the execution of the Job is done, perform a search in Atlas for the lineage information written by this Job and read the lineage there.

Reading the lineage

In Atlas, the lineage written by a Job consists of two types of entities:

  • the Job itself

  • the components in the Job that are using data schemas, such as tRowGenerator or tSortRow. The connection or configuration components such as tHDFSConfiguration are not taken into account since these components do not use schemas.

So the example Job generates 10 entities: one for the Job and nine for the components, and automatically adds three different tags to these entities:

  • Talend for all the entities generated by the Job

  • TalendComponent for all the component entities.

  • TalendJob for all the Job entities.

You can directly click one of these tags in Atlas to display the corresponding entities. For example, the following entities are displayed when you click TalendComponent:

Then you can click any of them to see the lineage information the corresponding component contains. The following image shows how the data flow is handled after being generated by the tRowGenerator component: