tFirstnameMatch - 6.3

Talend Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Warning

This component is available in the Palette of the Studio only if you have subscribed to one of the Talend Platform products.

Function

tFirstnameMatch compares the first name column from the input flow with first names in an embedded reference index and outputs the matching first names.

This index has first names for about 162 countries, and it has more than 1000 reference first names for some countries. For further information, see About the reference index embedded in tFirstnameMatch.

Purpose

Helps ensuring the data quality of first names against a reference index in order to standardize data.

About the reference index embedded in tFirstnameMatch

tFirstnameMatch checks first names against an index file embedded in the component itself. This component searches first names in the index file according to the input gender and input country you specify in the component settings. When you do not use the gender and country as a search basis, first names are searched throughout all the index, whatever the country is.

The index file has reference first names for about 162 countries. Some of the countries listed in the index have more than 1000 reference first names. Such countries include USA, GBR, AUS, IRL, CAN, FRA, NZL, CHE and NLD. For example, the index file has more than 8000 American first names, more than 4000 British first names, more than 2000 Australian first names and so on.

Some other countries have less than 1000 reference first names stored in the index file. For such countries, it is advisable not to select a country column so that the input first name is checked against all reference first names of all countries in the index file.

tFirstnameMatch properties

Component family

Data Quality

 

Basic settings

Schema and Edit Schema

A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository.

Since version 5.6, both the Built-In mode and the Repository mode are available in any of the Talend solutions.

One read-only column, FIRSTNAMEMATCH is added to the output schema automatically.

 

 

Built-in: The schema will be created and stored locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: The schema already exists and is stored in the Repository, hence can be reused in various projects and job designs. Related topic: see Talend Studio User Guide.

 

First Names

Select the column that contains first names.

 

Use Gender

Optional parameter: select this check box and then from the list, select the column that contains the gender. This will optimize system performance and give more precise results.

Expected genders are M (masculine) and F (Feminine).

 

Use Country

Optional parameter: select this check box and then from the list, select the column that contains the country ISO 3166-1 alpha-3 codes. This will optimize system performance and give more precise results.

 

Fuzzy Search

Select this check box if you want to get the best match possible, including approximate matches.

Advanced settings

tStatCatcher Statistics

Select this check box to gather the processing metadata at the Job level as well as at each component level.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

This component is not startable and it requires input and output components.

Limitation/prerequisite

The index used to standardize the first names is embedded in this component. For the time being, it is able to handle Latin names.

Scenario: Matching first names with a reference index

This scenario describes a four-component Job aiming at matching the name column of an input flow with the reference index.

The output of this first name match is displayed in the FIRSTNAMEMATCH output column along with all other columns defined in the input schema of the tFirstnameMatch component.

Dropping the components and linking them together

To drop and link the components of interest, proceed as follows:

  1. Drop the following components from the Palette to the design workspace: tFixedFlowInput, tFilterColumns, tFirstnameMatch and tLogRow.

  2. Connect the first three components using Row > Main links.

  3. Connect tFirstnameMatch to tLogRow using a Row > Output link.

Configuring the input data

To configure the input data, perform the following operations:

  1. Double-click tFixedFlowInput to display the Basic settings view and define the component properties.

  2. From the Schema list, set the schema type to Built-In and click the three-dot button next to Edit Schema. A dialog box displays.

  3. Click the plus button to add as many lines as needed for the input schema you want to create from internal variables.

    In this example, the input data flow is made of several columns including one for first names (name), two for country codes (iso2 and iso3) and one for gender (gender).

  4. Click OK to close the dialog box.

    The defined columns display in the Mode area of the component basic settings view.

  5. In the Mode area, select the Use Inline Content (delimited file) option to display the corresponding view.

  6. Set the row and field separators in the corresponding fields. You want to use these defined separators in your input flow.

  7. In the Content area, type in the data for the input flow according to the schema you defined earlier.

Configuring the process of matching data

To do this, you need to select the data columns of interest and then match them using tFirstnameMatch.

  1. Click the tFilterColumns component to display its Basic settings view and define the component properties.

    The tFilterColumns component enables you to build the output schema based on the column names of the input schema.

  2. Click the three-dot button next to Edit schema to display a dialog box where you can define the output schema.

  3. Select the name and gender columns from the input schema and move them to the output schema.

  4. Click OK to validate your changes and close the dialog box.

  5. Click tFirstnameMatch to display the Basic settings view and define the component properties.

  6. If required, click the three-dot button next to Edit schema to view the input and output schemas, and then click OK to close the dialog box.

    Note

    The output schema of this component is the same as the input schema plus one fixed column: FIRSTNAMEMATCH.