tMap - 6.1

Talend Open Studio for Big Data Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Open Studio for Big Data
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Function

tMap is an advanced component, which integrates itself as plugin to Talend Studio.

Purpose

tMap transforms and routes data from single or multiple sources to single or multiple destinations.

tMap properties

Component family

Processing

 

Basic settings

Map editor

It allows you to define the tMap routing and transformation properties.

Note

If you do not want to handle execution errors, you can click the Property Settings button at the top of the input area and select the Die on error check box (selected by default) in the [Property Settings] dialog box. It will kill the Job if there is an error.

 

Mapping links display as

Auto: the default setting is curves links

Curves: the mapping display as curves

Lines: the mapping displays as straight lines. This last option allows to slightly enhance performance.

  Temp data directory path Enter the path where you want to store the temporary data generated for lookup loading. For more information on this folder, see Talend Studio User Guide.

 

Preview

The preview is an instant shot of the Mapper data. It becomes available when Mapper properties have been filled in with data. The preview synchronization takes effect only after saving changes.

Advanced settings Max buffer size (nb of rows) Type in the size of physical memory, in number of rows, you want to allocate to processed data.
  Ignore trailing zeros for BigDecimal Select this check box to ignore trailing zeros for BigDecimal data.
 

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at the Job level as well as at each component level.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

Possible uses are from a simple reorganization of fields to the most complex Jobs of data multiplexing or demultiplexing transformation, concatenation, inversion, filtering and more...

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Limitation

The use of tMap supposes minimum Java knowledge in order to fully exploit its functionalities.

This component is a junction step, and for this reason cannot be a start nor end component in the Job.

Scenario 1: Mapping data using a filter and a simple explicit join

The Job described below aims at reading data from a csv file with its schema stored in the Repository, looking up at a reference file, the schema of which is also stored in the Repository, then extracting data from these two files based on a defined filter to an output file and reject files.

Linking the components

  1. Drop two tFileInputDelimited components, tMap and three tFileOutputDelimited components onto the design workspace.

  2. Rename the two tFileInputDelimited components as Cars and Owners, either by double-clicking the label in the design workspace or via the View tab of the Component view.

  3. Connect the two input components to tMap using Row > Main connections and label the connections as Cars_data and Owners_data respectively.

  4. Connect tMap to the three output components using Row > New Output (Main) connections and name the output connections as Insured, Reject_NoInsur and Reject_OwnerID respectively.

Configuring the components

  1. Double-click the tFileInputDelimited component labelled Cars to display its Basic settings view.

  2. Select Repository from the Property type list and select the component's schema, cars in this scenario, from the [Repository Content] dialog box. The rest fields are automatically filled.

  3. Double-click the component labelled Owners and repeat the setting operation. Select the appropriate metadata entry, owners in this scenario.

    Note

    In this scenario, the input schemas are stored in the Metadata node of the Repository tree view for easy retrieval. For further information regarding metadata creation in the Repository, see Talend Studio User Guide.

  4. Double-click the tMap component to open the Map Editor.

    Note that the input area is already filled with the defined input tables and that the top table is the main input table, and the respective row connection labels are displayed on the top bar of the table.

  5. Create a join between the two tables on the ID_Owner column by simply dropping the ID_Owner column from the Cars_data table onto the ID_Owner column in the Owners_data table.

  6. Define this join as an inner join by clicking the tMap settings button, clicking in the Value field for Join Model, clicking the small button that appears in the field, and selecting Inner Join from the [Options] dialog box.

  7. Drag all the columns of the Cars_data table to the Insured table.

  8. Drag the ID_Owner, Registration, and ID_Reseller columns of the Cars_data table and the Name column of the Owners_data table to the Reject_NoInsur table.

  9. Drag all the columns of the Cars_data table to the Reject_OwnerID table.

    For more information regarding data mapping, see Talend Studio User Guide.

  10. Click the plus arrow button at the top of the Insured table to add a filter row.

    Drag the ID_Insurance column of the Owners_data table to the filter condition area and enter the formula meaning 'not undefined': Owners_data.ID_Insurance != null.

    With this filter, the Insured table will gather all the records that include an insurance ID.

  11. Click the tMap settings button at the top of the Reject_NoInsur table and set Catch output reject to true to define the table as a standard reject output flow to gather the records that do not include an insurance ID.

  12. Click the tMap settings button at the top of the Reject_OwnerID table and set Catch lookup inner join reject to true so that this output table will gather the records from the Cars_data flow with missing or unmatched owner IDs.

    Click OK to validate the mappings and close the Map Editor.

  13. Double-click each of the output components, one after the other, to define their properties. If you want a new file to be created, browse to the destination output folder, and type in a file name including the extension.

    Select the Include header check box to reuse the column labels from the schema as header row in the output file.

Executing the Job

  1. Press Ctrl + S to save your Job.

  2. Press F6 to run the Job.

    The output files are created, which contain the relevant data as defined.

Scenario 2: Mapping data using inner join rejections

This scenario, based on scenario 1, adds one input file containing details about resellers and extra fields in the main output table. Two filters on inner joins are added to gather specific rejections.

Linking the components

  1. Drop a tFileInputDelimited component and a tFileOutputDelimited component to the design workspace, and label the components as Resellers and No_Reseller_ID respectively.

  2. Connect it to the Mapper using a Row > Main connection, and label the connection as Resellers_data.

  3. Connect the tMap component to the new tFileOutputDelimited component by using the Row connection named Reject_ResellerID.

Configuring the components

  1. Double-click the Resellers component to display its Basic settings view.

  2. Select Repository from the Property type list and select the component's schema, resellers in this scenario, from the [Repository Content] dialog box. The rest fields are automatically filled.

    Note

    In this scenario, the input schemas are stored in the Metadata node of the Repository tree view for easy retrieval. For further information regarding metadata creation in the Repository, see Talend Studio User Guide.

  3. Double-click the tMap component to open the Map Editor.

    Note that the schema of the new input component is already added in the Input area.

  4. Create a join between the main input flow and the new input flow by dropping the ID_Reseller column of the Cars_data table to the ID_Reseller column of the Resellers_data table.

  5. Click the tMap settings button at the top of the Resellers_data table and set Join Model to Inner Join.

  6. Drag all the columns except ID_Reseller of the Resellers_data table to the main output table, Insured.

    Note

    When two inner joins are defined, you either need to define two different inner join reject tables to differentiate the two rejections or, if there is only one inner join reject output, both inner join rejections will be stored in the same output.

  7. Click the [+] button at the top of the output area to add a new output table, and name this new output table Reject_ResellerID.

  8. Drag all the columns of the Cars_data table to the Reject_ResellerID table.

  9. Click the tMap settings button and select Catch lookup inner join reject to true to define this new output table as an inner join reject output.

    If the defined inner join cannot be established, the information about the relevant cars will be gathered through this output flow.

  10. Now apply filters on the two Inner Join reject outputs, in order for to distinguish the two types of rejection.

    In the first Inner Join output table, Reject_OwnerID, click the plus arrow button to add a filter line and fill it with the following formula to gather only owner ID related rejection: Owners_data.ID_Owner==null

  11. In the second Inner Join output table, Reject_ResellerID, repeat the same operation using the following formula: Resellers_data.ID_Reseller==null

    Click OK to validate the map settings and close the Mapper Editor.

  12. Double-click the No_Reseller_ID component to display its Basic settings view.

    Specify the output file path and select the Include Header check box, and leave the other parameters as they are.

  13. To demonstrate the work of the Mapper, in this example, remove reseller IDs 5 and 8 from the input file Resellers.csv.

Executing the Job

  1. Press Ctrl + S to save your Job.

  2. Press F6 to run the Job.

    The four output files are all created in the specified folder, containing information as defined. The output file No_Reseller_ID.csv contains the cars information related to reseller IDs 5 and 8, which are missing in the input file Resellers.csv.

Scenario 3: Cascading join mapping

As third advanced use scenario, based on the scenario 2, add a new Input table containing Insurance details for example.

Set up an Inner Join between two lookup input tables (Owners and Insurance) in the Mapper to create a cascade lookup and hence retrieve Insurance details via the Owners table data.

Scenario 4: Advanced mapping using filters, explicit joins and rejections

This scenario introduces a Job that allows you to find BMW owners who have two to six children (inclusive), for sales promotion purpose for example.

Linking the components

  1. Drop three tFileInputDelimited components, a tMap component, and two tFileOutputDelimited components from the Palette onto the design workspace, and label them to best describe their functions.

  2. Connect the input components to the tMap using Row > Main connections.

    Pay attention to the file you connect first as it will automatically be set as Main flow, and all the other connections will be Lookup flows. In this example, the connection for the input component Owners is the Main flow.

Configuring the components

  1. Define the properties of each input components in the respective Basic settings view. Define the properties of Owners.

  2. Select Repository from the Property type list and select the component's schema, owners in this scenario, from the [Repository Content] dialog box. The rest fields are automatically filled.

    Note

    In this scenario, the input schemas are stored in the Metadata node of the Repository tree view for easy retrieval. For further information regarding metadata creation in the Repository, see Talend Studio User Guide.

    In the same way, set the properties of the other input components: Cars and Resellers. These two Lookup flows will fill in secondary (lookup) tables in the input area of the Map Editor.

  3. Then double-click the tMap component to launch the Map Editor and define the mappings and filters.

    Set an explicit join between the Main flow Owner and the Lookup flow Cars by dropping the ID_Owner column of the Owners table to the ID_Owner column of the Cars table.

    The explicit join is displayed along with a hash key.

  4. In the Expr. Key field of the Make column, type in a filter. In this use case, simply type in "BMW" as the search is focused on the owners of this particular make.

  5. Implement a cascading join between the two lookup tables Cars and Resellers on the ID_Reseller column in order to retrieve resellers information.

  6. As you want to reject the null values into a separate table and exclude them from the standard output, click the tMap settings button and set Join Model to Inner Join in each of the Lookup tables.

  7. In the tMap settings, you can set Match Model to Unique match, First match, or All matches. In this use case, the All matches option is selected. Thus if several matches are found in the Inner Join, rows matching the explicit join as well as the filter, all of them will be added to the output flow (either in rejection or the regular output).

    Note

    The Unique match option functions as a Last match. The First match and All matches options function as named.

  8. On the output area of the Map Editor, click the plus button to add two tables, one for the full matches and the other for the rejections.

  9. Drag all the columns of the Owners table, the Registration, Make and Color columns of the Cars table, and the ID_Reseller and Name_Reseller columns of the Resellers table to the main output table.

  10. Drag all the columns of the Owners table to the reject output table.

  11. Click the Filter button at the top of the main output table to display the Filter expression area.

    Type in a filter statement to narrow down the number of rows loaded in the main output flow. In this use case, the statement reads: Owners.Children_Nr >= 2 && Owners.Children_Nr <= 6.

  12. In the reject output table, click the tMap settings button and set the reject types.

    Set Catch output reject to true to collect data about BMW car owners who have less than two or more than six children.

    Set Catch lookup inner join reject to true to collect data about owners of other car makes and owners for whom the reseller information is not found.

    Click OK to validate the mappings and close the Map Editor.

    On the design workspace, right-click the tMap and pull the respective output link to the relevant output components.

  13. Define the properties of the output components in their respective Basic settings view.

    In this use case, simple specify the output file paths and select the Include Header check box, and leave the other parameters as they are.

Executing the Job

  1. Press Ctrl + S to save your Job.

  2. Press F6 to run it.

    The main output file contains the information related to BMW owners who have two to six children, and the reject output file contains the information about the rest of the car owners.

Scenario 5: Advanced mapping with filters and different rejections

This scenario is a modified version of the preceding scenario. It describes a Job that applies filters to limit the search to BMW and Mercedes owners who have two to six children and divides unmatched data into different reject output flows.

Linking the components

  1. Take the same Job as in Scenario 4: Advanced mapping using filters, explicit joins and rejections.

  2. Drop a new tFileOutputDelimited component from the Palette on the design workspace, and name it Rejects_BMW_Mercedes to present its functionality.

  3. Connect the tMap component to the new output component using a Row connection and label the connection according to the functionality of the output component.

    This connection label will appear as the name of the new output table in the Map Editor.

  4. Relabel the existing output connections and output components to reflect their functionality.

    The existing output tables in the Map Editor will be automatically renamed according to the connection labels. In this example, relabel the existing output connections BMW_Mercedes_withChildren and Owners_Other_Makes respectively.

Configuring the components

  1. Double-click the tMap component to launch the Map Editor to change the mappings and the filters.

    Note that the output area contains a new, empty output table named Rejects_BMW_Mercedes. You can adjust the position of the table by selecting it and clicking the Up or Down arrow button at the top of the output area.

  2. Remove the Expr. key filter ("BMW") from the Cars table in the input area.

  3. Click the Filters button to display the Filter field, and type in a new filter to limit the search to BMW or Mercedes car makes. The statement reads as follows: Cars.Make.equals("BMW") || Cars.Make.equals("Mercedes")

  4. Select all the columns of the main output table and drop them down to the new output table.

    Alternatively, you can also drag the corresponding columns from the relevant input tables to the new output table.

  5. Click the tMap settings button at the top of the new output table and set Catch output reject to true to collect data about BMW and Mercedes owners who have less than two or more than six children.

  6. In the Owners_Other_Makes table, set Catch lookup inner join reject to true to collect data about owners of other car makes and owners for whom the reseller information is not found.

  7. Click OK to validate the mappings and close the Map Editor.

  8. Define the properties of the output components in their respective Basic settings view.

    In this use case, simple specify the output file paths and select the Include Header check box, and leave the other parameters as they are.