Profiling an HDFS file - Cloud - 8.0

Talend Studio User Guide

Version
Cloud
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Cloud
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development
Last publication date
2024-02-29
Available in...

Big Data Platform

Cloud API Services Platform

Cloud Big Data Platform

Cloud Data Fabric

Cloud Data Management Platform

Data Fabric

Data Management Platform

Data Services Platform

MDM Platform

Real-Time Big Data Platform

From the Profiling perspective of Talend Studio, you can generate a column analysis with simple statistics indicators on an HDFS file via a Hive connection.

Procedure

The sequence to create a profiling analysis on an HDFS file involves the following steps:

  1. Create a connection to a Hadoop cluster.
  2. Create a connection to a Hive server.
    This step is not mandatory as you will be prompted to create the connection to Hive simultaneously while you create the connection to an HDFS file.
  3. Create a connection to an HDFS file.
    This step will guide you to create a Hive external table, which leaves the data in the file, but creates a table definition in the Hive metastore. This allows Talend Studio to run SQL queries on the data in the file via the Hive connection.
  4. Create a column analysis with simple indicators on the Hive table.

What to do next

You can then modify the analysis settings and add other indicators as needed. You can also create other analyses later on this HDFS file by using the same Hive table.