Steps to profile an HDFS file - 6.3

Talend Real-time Big Data Platform Studio User Guide

Talend Real-Time Big Data Platform
Data Quality and Preparation
Design and Development
Talend Studio

From the Profiling perspective of the studio, you can generate a column analysis with simple statistics indicators on an HDFS file via a Hive connection.

The sequence to create a profiling analysis on an HDFS file involves the following steps:

  1. Create a connection to a Hadoop cluster.

  2. Create a connection to a Hive server.

    This step is not mandatory as you will be prompted to create the connection to Hive simultaneously while you create the connection to an HDFS file.

  3. Create a connection to an HDFS file.

    This step will guide you to create a Hive external table, which leaves the data in the file, but creates a table definition in the Hive metastore. This allows the studio to run SQL queries on the data in the file via the Hive connection.

  4. Create a column analysis with simple indicators on the Hive table.

    You can then modify the analysis settings and add other indicators as needed. You can also create other analyses later on this HDFS file by using the same Hive table.