Profiling an HDFS file - 7.1

Talend Data Management Platform Studio User Guide

author
Talend Documentation Team
EnrichVersion
7.1
EnrichProdName
Talend Data Management Platform
task
Design and Development
EnrichPlatform
Talend Studio
From the Profiling perspective of Talend Studio, you can generate a column analysis with simple statistics indicators on an HDFS file via a Hive connection.

Procedure

The sequence to create a profiling analysis on an HDFS file involves the following steps:

  1. Create a connection to a Hadoop cluster.
  2. Create a connection to a Hive server.
    This step is not mandatory as you will be prompted to create the connection to Hive simultaneously while you create the connection to an HDFS file.
  3. Create a connection to an HDFS file.
    This step will guide you to create a Hive external table, which leaves the data in the file, but creates a table definition in the Hive metastore. This allows the studio to run SQL queries on the data in the file via the Hive connection.
  4. Create a column analysis with simple indicators on the Hive table.

What to do next

You can then modify the analysis settings and add other indicators as needed. You can also create other analyses later on this HDFS file by using the same Hive table.