Scenario: Filtering and aggregating table columns directly on the DBMS - 6.3

Talend Open Studio for Big Data Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Open Studio for Big Data
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

The following scenario creates a Job that opens a connection to a MySQL database and:

  • populates a database table with the input data,

  • creates the output table for the filtered data,

  • instantiates the schema from a database table in part (for column filtering),

  • filters two columns in the same table to get only the data that meets two filtering conditions,

  • collects data from the filtered column(s), grouped by specific value(s) and writes aggregated data in a target database table.

Adding and linking the components

  1. Drop the following components from the Palette onto the design workspace: tMysqlConnection, tFixedFlowInput, tMysqlOutput, tCreateTable, tCombinedSQLInput, tCombinedSQLFilter, tCombinedSQLAggregate, tCombinedSQLOutput, tMysqlCommit, tMysqlInput and tLogRow.

  2. Connect tMysqlConnection to tFixedFlowInput using a Trigger > On Subjob Ok link

  3. Do the same to connect tFixedFlowInput to tCreateTable, tCreateTable to tCombinedSQLInput, tCombinedSQLInput to tMysqlCommit, and tMysqlCommit to tMysqlInput.

  4. Connect tFixedFlowInput and tMysqlOutput using a Row > Main link.

  5. Connect tCombinedSQLInput to tCombinedSQLFilter using a Row > Combine link.

  6. Do the same to connect tCombinedSQLFilter to tCombinedSQLAggregate, and tCombinedSQLAggregate to tCombinedSQLOutput

  7. Connect tMysqlInput and tLogRow using a Row > Main link.

Configuring the components

Opening a MySQL connection

  1. Launch MySQL Workbench and start a local connection on port 3306.

  2. Create a new schema and name it test.

  3. Back in the design workspace, select tMysqlConnection and click the Component tab to define its basic settings.

  4. In the Basic settings view, set the database connection details manually or select Repository from the Property Type list and select your DB connection if it has already been defined and stored in the Metadata area of the Repository tree view.

    For more information on centralizing DB connection details in the Repository, see Talend Studio User Guide.

Populating the database table with input data

  1. In the design workspace, select tFixedFlowInput and click the Component tab to define its basic settings

  2. In the Basic settings view, in the Number of rows field, enter 500.

  3. In this scenario, the source database table has seven columns: id, first_name, last_name, city, state, date_of_birth, and salary

    Click the [...] button next to Edit schema to define the following data structure.

  4. Click the floppy disk icon to save the schema as a generic schema for later reuse.

  5. In the Select folder window, select default and click OK.

  6. Choose a name for your generic schema and click Finish.

  7. Click OK.

  8. The first column of the Values table automatically reflects the data structure you entered previously.

  9. In the Values table, enter a value for each column.

  10. In the design workspace, select tMysqlOutput and click the Component tab to define its basic settings.

    The output schema will automatically be the same as the previous component, in this case tFixedFlowInput.

Creating the target database table

  1. In the design workspace, select tCreateTable and click the Component tab to define its basic settings.

  2. Click the [...] button next to Edit schema to define the following data structure.

    The schema you enter at this step must reflect the the differents aggregation operations you want to perform on the input data.

Extracting and filtering data

  1. In the design workspace, select tCombinedSQLInput and click the Component tab to access the configuration panel.