Scenario: Working with Hive on an Amazon EMR cluster

EnrichVersion
Cloud
6.4
EnrichProdName
Talend Big Data
Talend Real-Time Big Data Platform
Talend Data Fabric
Talend Big Data Platform
Talend Open Studio for Big Data
task
Data Governance > Third-party systems > Database components > Hive components
Design and Development > Third-party systems > Amazon services (Integration) > Amazon EMR components
Data Governance > Third-party systems > Amazon services (Integration) > Amazon EMR components
Design and Development > Designing Jobs > Hadoop distributions > Amazon EMR
Design and Development > Third-party systems > Database components > Hive components
Data Quality and Preparation > Third-party systems > Database components > Hive components
Data Quality and Preparation > Third-party systems > Amazon services (Integration) > Amazon EMR components
EnrichPlatform
Talend Studio

Scenario: Working with Hive on an Amazon EMR cluster

This article shows how to work with Hive on an Amazon EMR cluster.

This example uses Talend Real-time Big Data Platform 6.1. In addition, it uses these licensed products provided by Amazon:

  • Amazon EC2
  • Amazon EMR

    For more information about how to launch an Amazon EMR cluster from the Talend Studio, see Amazon EMR - Getting Started.

Create Hive connection metadata

This section show how to define reusable metadata for connections to a Hive infrastructure, hosted by your Amazon EMR cluster.

Before you begin

We assume that you already have launched an Amazon EMR 4.0.0 cluster and that you configured the cluster metadata in the Talend Repository.

Procedure

  1. From the Repository, right-click your cluster metadata and click Create Hive.
  2. In the Login field you should find hadoop and in the Server field you should find your master node DNS.
  3. Set the Port to 10000:
  4. Click Check to verify the connection to Hive.

Create a Hive table

Before you begin

We assume that a file named CustomersData has already been written to HDFS, and we will convert this to a Hive table.

In the following example we use the Hive table creation wizard.

Procedure

  1. Switch to the Profiling perspective.
  2. From the DQ Repository, right-click your HDFS connection metadata and click Create Hive Table.
  3. In the browser, select the folder containing your file to be converted to a Hive table:
  4. Wait until the creation status changes to Success. Click Next.
  5. Update the table Name and Schema, as needed.
    In the current example, the table will be named CustomersTable and the existing Hive connection will be used.
  6. Click Finish to create the Hive table.
    Your table is created and appears in the DQ Repository under Metadata > DBConnections > HiveConnection > default:

Running a Hive Table analysis

Before you begin

You can leverage your cluster computation capabilities to run analyses on your Hive table.

Procedure

In the Profiling perspective, right-click a Hive table and then, select the analysis you want to run on your Hive table:

Each analysis will be sent as a Hive QL request to your cluster and will run as a MapReduce job.

The results will be displayed in the Talend Studio as charts or tables.

For more information about other ways to work with tables, see the article Work with Amazon Relational Database Service (RDS).