Skip to main content Skip to complementary content

Profiling an HDFS file

From the Profiling perspective of Talend Studio, you can generate a column analysis with simple statistics indicators on an HDFS file via a Hive connection.

Procedure

The sequence to create a profiling analysis on an HDFS file involves the following steps:

  1. Create a connection to a Hadoop cluster.
  2. Create a connection to a Hive server.
    This step is not mandatory as you will be prompted to create the connection to Hive simultaneously while you create the connection to an HDFS file.
  3. Create a connection to an HDFS file.
    This step will guide you to create a Hive external table, which leaves the data in the file, but creates a table definition in the Hive metastore. This allows Talend Studio to run SQL queries on the data in the file via the Hive connection.
  4. Create a column analysis with simple indicators on the Hive table.

What to do next

You can then modify the analysis settings and add other indicators as needed. You can also create other analyses later on this HDFS file by using the same Hive table.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!