Data quality system routines - 6.5

Talend Open Studio for MDM User Guide

EnrichVersion
6.5
EnrichProdName
Talend Open Studio for MDM
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Several types of TDQ system routines exists in Talend Studio. The sections below describe these routines.

DataQuality routines

DataQuality routines are portions of code that parse and clean your data. You can then use them in the settings of Data Quality components in a Talend Job.

To access these routines:

  • In the Repository tree view of your Studio, expand Code > Routines > system and then double-click DataQuality.

The table below gives a description of these routines:

Routine

Description

Syntax

getTitle

Returns the extracted title based on a list of provided titles.

DataQuality.getTitle

getNameWithoutTitle

Returns the string without the title.

DataQuality.getNameWithoutTitle

getNameWithoutSuffix

Returns the string without the suffix.

DataQuality.getNameWithoutSuffix

getLastName

Returns only LastName value from the string.

DataQuality.getLastName

getCleansedLastName

Returns the string without the title and suffix.

DataQuality.getCleansedLastName

isValidName

Returns true/false if the string is a valid/unvalid formed name respectively: [Initial] Firstname [Initial] Lastname.

DataQuality.isValidName

DqStringHandling routines

DqStringHandling routines are portions of code that perform certain quality tasks on arrays of characters. You can then use them in the settings of Data Quality components in a Talend Job.

To access these routines:

  • In the Repository tree view of your Studio, expand Code > Routines > system and then double-click DqStringHandling.

The table below gives a description of these routines:

Routine

Description

Syntax

containsOnlyInitials

Returns true if the given string contains only initials such as "A", "A.", "A.I." or "A.I.D.S".

DqStringHandling.containsOnlyInitials

makeSafe

Returns an empty string when the given string is null.

DqStringHandling.makeSafe

safeTrim

Returns the trimmed string or the empty string when the string is null.

DqStringHandling.safeTrim

safeConcat (String str1, String str2, char separator)

Returns the concatenation of the trimmed strings. The separator character is used when none of the given strings is empty or null.

DqStringHandling.safeConcat

safeConcat (char separator, String... strings)

Returns the concatenation of the trimmed strings. The separator character is used when none of the given strings is empty or null.

DqStringHandling.safeConcat

validAscii

Validates the ascii format or not.

DqStringHandling.validAscii

Use case: handling strings using DqStringHandling routines

In this use case, a three-component Job uses several DqStringHandling routines concurrently to handle several given strings.

In this Job, the three components are:

  • tFixedFlowInput: this component provides the strings to be handled by the DqStringHandling routines you select.

  • tMap: this component transforms the given strings using the DqStringHandling routines you select.

  • tLogRow: this component displays the execution result of this Job.

To replicate this use case, proceed as follows:

Dropping and linking the components together

  1. Drop tFixedFlowInput, tMap and tLogRow from the Palette to the design workspace.

  2. Right-click tFixedFlowInput to open its contextual menu, select the Row > Main link then click tMap to connect these two components.

  3. Do the same to connect tMap to tLogRow using the Row > *New Output* (Main) link and name this link in the dialog box that pops up when you click tLogRow to create this link. In this use case, name it out.

Configuring the first component

  1. Double-click tFixedFlowInput to open its Component view.

  2. Click the [...] button next to Edit schema to open the schema editor.

  3. Click the [+] button four times to add four rows.

  4. Rename these four rows as, for example, Entity_string, str1, str2 and str3 and make sure that their type isString.

    If a type is not String, click it and select String from the drop-down list that appears accordingly.

  5. Click OK to validate these changes and thus close this schema editor.

  6. In the Mode area of the Component view, select Use Inline Content (delimited file) to display the Content field.

  7. In the Content field, type in four strings corresponding to the number of rows you have defined in the component schema and separate them using the field separator that is displayed in the Field Separator field. By default, the field separator is ; and you can customize it if need be; the strings entered in this use case are U.S.A. (a space is typed in front of the U letter which will be trimmed by the routine to be used), Global, Business, Services.

    Thus the strings all together entered here read: U.S.A.;Global;Business;Services.

Configuring the tMap component

  1. Double-click tMap to open the Map editor.

  2. On the right side of the lower part of the editor, click the [+] button to add 8 rows and rename them with: Entity_string, contains_only_initials, safe_trim, contains_only_initials_after_trim, str1, str2, str3, safe_concat.

    These rows correspond to the original input data or the data to be handled by the DqStringHandling routines.

  3. In this table, click, in the Type column, the contains_only_initials and the contains_only_initials_after_trim rows respectively and select Boolean from the drop-down list for each of them.

    On the upper part of this editor, these new rows have been added to the output table on the right side automatically.

  4. From the row1 input table on the left side to the Expression column of the output table on the right side, drop the Entity_string, the str1, the str2 and the str3 rows respectively on the corresponding rows on the output side.

  5. In the Expression column, select the contains_only_initials row.

  6. Press Ctrl+space to open the auto-completion list.

  7. From this list, select DqStringHandling.containOnlyInitials.

  8. Continue to enter the row of data to be handled by this selected routine between the parentheses to replace the default one. In this use case, this row is row1.Entity_string.

    Thus the entered expression reads as the following: Handling.containsOnlyInitials(row1.Entity_string).

  9. Do the same to complete the rest of the rows in the Expression column using the corresponding DqStringHandling routines to each row. Then these row expressions read as follows:

    • DqStringHandling.safeTrim(row1.Entity_string) for the safe_trim row;

    • DqStringHandling.containsOnlyInitials; (DqStringHandling.safeTrim(row1.Entity_string) ) for the contains_only_initials_after_trim row;

    • DqStringHandling.safeConcat(' ', row1.Entity_string , row1.str1, row1.str2 , row1.str3) for the safe_concat row.

    Note

    As is presented in this use case, different DqStringHandling routines can be combined in one expression to produce an advanced result.

  10. Click OK to validate these changes and accept the propagation prompted by the dialog box that pops up.

Finalizing and executing the Job

  1. Double-click tLogRow to open its Component view.

  2. Select Vertical (each row is a key/value list).

  3. Press F6 to run this Job.

    The execution result is displayed in the console of the Run view.

From the result, you can see that the contains_only_initials row outputs false because there is a space in front of U.S.A. while the contains_only_initials_after_trim presents true because this space is trimmed; the safe_trim row presents U.S.A. with the space trimmed and the safe_concat row outputs U.S.A. Global Business Services.

DQTechnical routines

DQTechnical routines are portions of code that perform advanced String manipulation. You can then use them in the settings of Data Quality components in a Talend Job.

To access these routines:

  • In the Repository tree view of your Studio, expand Code > Routines > DQTechnical and then double-click DqStringHandling.

The table below gives a description of these routines:

Routine

Description

Syntax

extractTitle

Returns the extracted title based on a list of provided titles.

DQTechnical.extractTitle

removeTitle

Returns the string without the title.

DQTechnical.removeTitle

extractLastName

Returns the string without the last name.

DQTechnical.extractLastName

removeNameSuffix

Returns the string without the suffix.

DQTechnical.removeNameSuffix

isNameStringValid

Returns true or false if the string is a valid formed name: [Initial] Firstname [Initial] Lastname.

DQTechnical.isNameStringValid