Translating the scenario into Jobs - 6.4

Talend Big Data Platform Studio User Guide

EnrichVersion
6.4
EnrichProdName
Talend Big Data Platform
task
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This section describes how to set up connection metadata to be used in the example Jobs, and how to create, configure, and execute the Jobs to get the expected result of this scenario.

Setting up connection metadata to be used in the Jobs

In this scenario, an HDFS connection and an HCatalog connection are repeatedly used in different Jobs. To simplify component configurations, we can centralized those connections under a Hadoop cluster connection in the Repository view for easy reuse.

Setting up a Hadoop cluster connection

  1. Right-click Hadoop cluster under the Metadata node in the Repository tree view, and select Create Hadoop cluster from the contextual menu to open the connection setup wizard. Give the cluster connection a name, Hadoop_Sandbox in this example, and click Next.

  2. Configure the Hadoop cluster connection:

    • Select a Hadoop distribution and its version.

    • Specify the NameNode URI and the Resource Manager. In this example, we use the host name sandbox, which is supposed to have been mapped to the IP address assigned to the Sandbox virtual machine, for both the NameNode and Resource Manager and the default ports, 8020 and 50300 respectively.

    • Specify a user name for Hadoop authentication, sandbox in this example.

  3. Click Finish. The Hadoop cluster connection appears under the Hadoop Cluster node in the Repository view.

Setting up an HDFS connection

  1. Right-click the Hadoop cluster connection you just created, and select Create HDFS from the contextual menu to open the connection setup wizard. Give the HDFS connection a name, HDFS_Sandbox in this example, and click Next.

  2. Customize the HDFS connection settings if needed and check the connection. As the example Jobs work with all the suggested settings, simply click Check to verify the connection.

  3. Click Finish. The HDFS connection appears under your Hadoop cluster connection.

Setting up an HCatalog connection

  1. Right-click the Hadoop cluster connection you just created, and select Create HCatalog from the contextual menu to open the connection setup wizard. Give the HCatalog connection a name, HCatalog_Sandbox in this example, and click Next.

  2. Enter the name of database you will use in the Database field, talend in this example, and click Check to verify the connection.

  3. Click Finish. The HCatalog connection appears under your Hadoop cluster connection.

Now these centralized metadata items can be used to set up connection details in different components and Jobs. Note that these connections do not have table schemas defined along with them; we will create generic schemas separately later on when configuring the example Jobs.

For more information on centralizing Big Data specific metadata in the Repository, see Managing metadata for Talend Big Data. For more information on centralizing other types of metadata, see Managing Metadata for data integration.

Creating the example Jobs

In this section, we will create six Jobs that will implement the ApacheWebLog example of the demo Job.

Create the first Job

Follow these steps to create the first Job, which will set up an HCatalog database to manage the access log file to be analyzed:

  1. In the Repository tree view, expand the Job Designs node and right-click Standard Jobs, and select Create folder to create a new folder to group the Jobs that you will create.

    Right-click the folder you just created, and select Create job to create your first Job. Name it A_HCatalog_Create to identify its role and execution order among the example Jobs. You can also provide a short description for your Job, which will appear as a tooltip when you move your mouse over the Job.

  2. Drop a tHDFSDelete and two tHCatalogOperation components from the Palette onto the design workspace.

  3. Connect the three components using Trigger > On Subjob Ok connections. The HDFS subjob will be used to remove any previous results of this demo example, if any, to prevent possible errors in Job execution, and the two HCatalog subjobs will be used to create an HCatalog database and set up an HCatalog table and partition in the created HCatalog table, respectively.

  4. Label these components to better identify their functionality.

Create the second Job

Follow these steps to create the second Job, which will upload the access log file to the HCatalog:

  1. Create a new Job and name it B_HCatalog_Load to identify its role and execution order among the example Jobs.

  2. From the Palette, drop a tApacheLogInput, a tFilterRow, a tHCatalogOutput, and a tLogRow component onto the design workspace.

  3. Connect the tApacheLogInput component to the tFilterRow component using a Row > Main connection, and then connect the tFilterRow component to the tHCatalogOutput component using a Row > Filter connection. This data flow will load the log file to be analyzed to the HCatalog database, with any records having the error code of "301" removed.

  4. Connect the tFilterRow component to the tLogRow component using a Row > Reject connection. This flow will print the records with the error code of "301" on the console.

  5. Label these components to better identify their functionality.

Create the third Job

Follow these steps to create the third Job, which will display the content of the uploaded file:

  1. Create a new Job and name it C_HCatalog_Read to identify its role and execution order among the example Jobs.

  2. Drop a tHCatalogInput component and a tLogRow component from the Palette onto the design workspace, and link them using a Row > Main connection.

  3. Label the components to better identify their functionality.

Create the fourth Job

Follow these steps to create the fourth Job, which will analyze the uploaded log file to get the code occurrences in successful calls to the website:

  1. Create a new Job and name it D_Pig_Count_Codes to identify its role and execution order among the example Jobs.

  2. Drop the following components from the Palette to the design workspace:

    • a tPigLoad, to load the data to be analyzed,

    • a tPigFilterRow, to remove records with the '404' error from the input flow,

    • a tPigFilterColumns, to select the columns you want to include in the result data,

    • a tPigAggregate, to count the number of visits to the website,

    • a tPigSort, to sort the result data, and

    • a tPigStoreResult, to save the result to HDFS.

  3. Connect these components using Row > Pig Combine connections to form a Pig chain, and label them to better identify their functionality.