Skip to main content Skip to complementary content

Creating a Big Data Batch Job with an HDFS connection

After you have created a Hadoop Cluster and a Structure, design the Big Data Batch Job including the tHDFSConfiguration, tHMapInput, and tLogRow components.

Procedure

  1. Open the Integration perspective and navigate to Repository > Job Designs.
  2. Right-click Big Data Batch and select Create Big Data Batch Job.
  3. Enter the necessary details to create the Job.
  4. Drag the Hadoop Cluster metadata you created into the Job Design and select the tHDFSConfiguration component.
  5. Add a tHMapInput and a tLogRow and connect these using Row > Main connection.
    1. Enter Output, when prompted for the output name.
  6. Double-click the tLogRow and define its schema:
    1. Click the […] button next to Edit schema.
    2. In the Output (Input) section, click the + to add three new columns and name them firstName, lastName and age.
    3. Click the button to copy the columns to tLogRow_1 (Output).
  7. Click the tHMapInput and open the Basic Settings tab.
    1. Select the Define a storage configuration component check box and select the tHDFSConfiguration component as the chosen storage.
    2. Specify the input file in the Input field.
    3. Click the […] button next to Configure Component and select the structure you created earlier.
    4. Select CSV in the Input Representation drop-down list.
    5. Click Next and add the input file in the Sample File field, then click Run to check the number records found.
    6. Click Finish.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!