tStewardshipTaskOutput - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Warning

This component is available in the Palette of Talend Studio but you will only be able to use it on the condition that you have subscribed to the relevant Talend Platform product.

tStewardshipTaskOutput properties

Component family

Talend MDM

 

Function

tStewardshipTaskOutput writes data, in the form of tasks, in the Talend Data Stewardship Console database and thus makes it possible to list these tasks in the data stewardship console. An authorized steward can then intervene to do the composite matching on the listed data or to insure that data is consistent and complete.

Note

In order to better understand the purpose of this component, check the Talend Data Stewardship Console User Guide.

Purpose

This component create tasks in the Talend Data Stewardship Console database and lists these tasks in the stewardship console.

Basic settings

Schema and Edit Schema

A schema is a row description, it defines the number of fields that will be processed and passed on to the next component. The schema is either built-in or remote in the Repository.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

 

 

Built-in: You create the schema and store it locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

 

Url

Enter the appropriate URL to access the Talend Data Stewardship Console application.

For more information about the URL settings, see How to set the URL to access Talend Data Stewardship Console.

 

Username and Password

Type in the user authentication data for the stewardship console database.

To enter the password, click the [...] button next to the password field, and then in the pop-up dialog box enter the password between double quotes and click OK to save the settings.

 

Task name

Type in a name for the task you want to list in the Talend Data Stewardship Console.

 

Type

Select the type of the tasks you want to write:

Resolution:data resolution tasks represent the results of the data matching processes done on data across heterogeneous sources.

Data: data integrity tasks are the results of the data integrity processes done on data.

For further information on task types and task management, see Talend Data Stewardship Console User Guide.

 

Created by

Type in the name of the task creator.

Note

The task creators correspond to the users of Talend MDM Web User Interface. For further information, see Talend MDM Web User Interface User Guide.

 

Owner

Type in the name of the task owner.

Note

The task owners correspond to the users of Talend MDM Web User Interface. For further information, see Talend MDM Web User Interface User Guide.

 

Star

Type in a number, 0 through 5, that you want to assign to the tasks as a numerical rating, in the form of stars, to highlight importance.

 

Tag

Type in the name of the tag category you want to associate with the tasks you want to write.

Warning

The tag categories must have been created in the stewardship console beforehand. For further information about how to create tag categories, see Talend Data Stewardship Console User Guide.

Note

Only resolution task

Looping column

Select a column in the input schema on which to base the loop. Whenever the looping column value changes, the component will close the previous element (task) and open a new one (new task).

Note

The looping column is typically the group id generated by the tMatchGroup component. For further information, see tMatchGroup.

 

Source/Target selector

Select a column in the input schema that will decide if the task records defined according to the looping column will be a target record or a source record.

 

Source

Select a column in the input schema.

Note

Only resolution task

Score

Select the matching score column in the input schema.

Note

Only resolution task

Weights

Select the column that defines the matching distance for each column in the input schema.

 

Extra info

If required, use the plus button to add one or more rows for any extra information you want to add to any of the source records.

In the Title column, enter the information key.

In the Message column, enter the information you want to add. In the Column column, click in the added row and select the source column to which you want to add the extra information.

The data steward will be able to see this added information any time he/she places the pointer on the source record column in the Talend Data Stewardship Console. This information will help him/her making a more informed decision when resolving the task.

 

Record column

Use the plus button to add as many rows as needed and then click in each of the rows and select the columns in the input schema that will form the target record.

 

Max tasks per commit

Define the maximum number of the tasks per commit.

Advanced settings

tStatCatcher Statistics

Select this check box to gather the processing metadata at the Job level as well as at each component level.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

NB_LINE: the number of rows processed. This is an After variable and it returns an integer.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

Use this component to write data records held in tasks. This component must have an input flow.

If a Job has too many tasks to be handled with the Talend Data Stewardship Console application, you are recommended to increase the timeout values before executing the Job.

You can customize two timeout values as needed:

  • -Dtaskload_connect_timeout which specifies the timeout value for connecting to the Talend Data Stewardship Console application;

  • -Dtaskload_read_timeout which specifies the timeout value for reading the Talend Data Stewardship Console application.

By default, their values are both 50,000 in milliseconds.

To increase the timeout values, do the following:

  1. In the Run view, select the Advanced settings tab.

  2. In the JVM Settings area of the tab view, select the Use specific JVM arguments check box to activate the Argument table.

  3. Next to the Argument table, click the New... button to open the [Set the VM Argument] dialog box.

  4. In the dialog box, enter the timeout value in milliseconds. For example, -Dtaskload_connect_timeout=60000.

  5. Click OK to close the dialog box.

  6. Repeat the steps above to set another timeout value in milliseconds. For example, -Dtaskload_read_timeout=60000.

    For further information about how to apply the JVM argument for all of the Job executions, see Talend Studio User Guide.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

 

How to set the URL to access Talend Data Stewardship Console

When using the components to interact with the Talend Data Stewardship Console, you need to set the URL correctly to access the application:

  • To interact with the Talend Data Stewardship Console application using the SOAP services, the URL should be in the format of <protocol>://<host>:<port>/<context>/services/TDSCWS?wsdl.

  • To write tasks into the Talend Data Stewardship Console application, the URL should be in the format of <protocol>://<host>:<port>/<context>/services/dsctaskloader.

Note that the parameter <context> in the URL is different depending on whether Talend Data Stewardship Console is installed standalone (standalone installation) or installed together with the MDM server (embedded installation).

For more information about how to install Talend Data Stewardship Console as a standalone application, see Talend Data Stewardship Console User Guide.

For more information about how to install the MDM server, see Talend Installation Guide.

Below are the default parameters of the URL to access the Talend Data Stewardship Console application in the two installation modes:

Default settings

Standalone installation

Embedded installation

protocol

httphttp

host

localhostlocalhost

port

80808180

context

/org.talend.datastewardship/talendmdm

SOAP service URL

http://localhost:8080/org.talend.datastewardship/services/TDSCWS?wsdlhttp://localhost:8180/talendmdm/services/TDSCWS?wsdl

Task loader URL

http://localhost:8080/org.talend.datastewardship/services/dsctaskloaderhttp://localhost:8180/talendmdm/services/dsctaskloader

Scenario: Writing data records in the stewardship console database

This scenario describes a five-component Job that generates data records in the form of tasks and loads them into the stewardship console database.

These tasks will need later the intervention of an authorized data steward to merge, compare and resolve the data records that are held in these tasks. For further information, see Talend Data Stewardship Console User Guide.

In this scenario:

  • A tFixedFlowInput component generates input data flow that has five columns: Source, Firstname, Lastname, DOB (date of birth), and PostalCode. This data has problems such as duplication, first or last names spelled differently or wrongly, different information for the same customer, etc.

  • A tMatchGroup data quality component carries out matching operations on data across the heterogeneous sources defined in the input Source column. This component groups the output columns by a blocking value to optimize the matching operation and compare only the records that have the same blocking value, the Source column in this scenario. For more information on grouping output columns and using blocking values, see tMatchGroup.

  • A tMap component filters the input flow into unique data records and data records that have matching distances.

  • The unique data records are displayed on the Run console via the tLogRow component. All other data records that have a matching distance are sent to the Talend Data Stewardship Console database through the tStewardshipTaskOutput component and are displayed in the stewardship console. An authorized data steward can then intervene to merge the data records with matching distances.

For detail information about related scenarios, see Scenario 1: Generating functional keys in the output flow and Scenario 2: Comparing columns and grouping in the output flow duplicate records that have the same functional key.

  • Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tMatchGroup, tMap, tStewardshipTaskOutput and tLogRow.

  • Connect the first three components together using the Main link.

  • Double-click tFixedFlowInput to display the Basic settings view and define the component properties as described in Scenario 1: Generating functional keys in the output flow.

    The tFixedFlowInput component generates an input data flow that has five columns: Source, Firstname, Lastname, DOB (date of birth), and PostalCode. This data has problems such as duplication, first or last names spelled differently or wrongly, different information for the same customer, etc.

  • Double-click the tMatchGroup component to display the Basic Settings view and define the component properties.

  • Click Sync columns to retrieve the schema from the preceding component.

  • If required, click the Edit schema button to view the input and output schema and do any modifications in the output schema.

Note

In the output schema of this component, there are four output standard columns that are read-only. For more information, see tMatchGroup properties.

  • In the Key definition table, click the [+] button to add to the list the columns on which you want to do the matching operation, FirstName and LastName in this scenario.

  • Click in the first and second cells of the Matching type column and select from the list the method(s) to be used for the matching operation, Jaro-Winkler in this example.

  • Click in the first and second cells of the Confidence Weight column and set the numerical weights for each of the columns used as key attributes.

  • Click the [+] button below the Blocking Definition table to add a line in the table then click in the line and select from the list the column you want to use as a blocking value, Source in this example.

    Using a blocking value reduces the number of pairs of records that needs to be examined. The input data is partitioned into exhaustive blocks based on the data source. This will decrease the number of pairs to compare, as comparison is restricted to record pairs within each block.

  • Double-click the tMap component to open the Map Editor.

The input area to the left is already filled with the input schema coming from the previous component in the Job design.

  • Click the [+] button in the upper right corner of the output area to add as many output tables as needed, two in this example uniques and groups. The first table will group the unique data records and the second will group all the records that have matching distances to the master record in each group.

  • Drop the input columns to fill in the first output schema. For further information regarding data mapping, see Talend Studio User Guide.

    All the columns will be automatically filled in the Schema Editor in the below half of the Map Editor.

  • Click in the upper right corner of the first output table to add a condition to filter the data in the first output table: row2.GRP_SIZE == 1.

  • Drop the input columns to fill in the second output schema and add the following filter: row2.GRP_SIZE > 1 || !row2.MASTER.

  • In the Schema Editor of the second output table, click the [+] button to add two extra columns: weight and istarget. The first to measure the matching distance and the second to decide if the record will be a target record or a source record.

  • Click Ok to close the Map Editor.

  • In the design workspace, right-click tMap and select the uniques link and drop it on the tLogRow component. Do the same to connect tMap to tStewardshipTaskOutput with the groups link.

  • Double-click the tStewardshipTaskOutput component to display its Basic settings view and define the component properties.

  • In the Schema list, select Built-In and click the [...] button next to Edit schema to open a dialog box.

The data is collected from the columns defined in the groups output table in the tMap component.

  • Click OK to close the dialog box and proceed to the next step.

  • In the Url field, enter the URL for connecting to the stewardship console database.

  • In the Username and Password fields, enter your login and password to connect to the MDM server.

  • In the Task name field, enter a functional name for the task you want to list in Talend Data Stewardship Console.

  • From the Type list, select the type of the tasks you want to write in the stewardship console: Resolution or Data. In this example, only resolution tasks are to be written.

    For further information on task type, see Talend Data Stewardship Console User Guide.

  • In the Created by field, enter between inverted commas the name of the task creator, Administrator in this example. The task creator corresponds to the users of Talend MDM Web User Interface. For further information, see Talend MDM Web User Interface User Guide.

  • In the Owner field, enter between inverted commas the name of the task owner, the user to whom the task is assigned, Administrator in this example.

Note

Task can be assigned to a specific user either from the Basic settings view of the tStewardshipTaskOutput component, or directly from the stewardship console by an administrator. For further information, see tStewardshipTaskOutput.

  • In the Star field, enter between inverted commas the number of stars, 0 through 5, you want to assign to the task in the stewardship console to highlight importance.

  • In the Tags field, enter between inverted commas the name of the tag category associated with the tasks you want to read, not used in this example.

    For further information, see Talend Data Stewardship Console User Guide.

  • From the Looping column list, select a column in the input schema on which to base the loop, GID in this Example.

  • From the Source/Target selector list, select the column that will decide if the record will be a target record or a source record.

  • From the Source list, select a source column in the input schema.

  • From the Score list, select the matching score column in the input schema.

  • From the Weights list, select the column that defines the matching distance for the input columns.

  • In the Extra info table, click the button to add one or several rows that you can use to add extra information to one or several record in the created task.

Note

You can click the button to add all your schema in one go without having to add it row by row.

  • In the Title column, enter between inverted commas the role of the person who adds the information.

  • In the Info column, enter between inverted commas the extra information you want to attach to the selected column.

  • Click in the Scope column row and select from the list the record to which you want to add the extra information, PostalCode in this example.

    This will append a red mark to the PostalCode column when we open the relevant task in Talend Data Stewardship Console

When the data steward place the pointer on this mark, the attached information will display. Such information may help the steward in resolving the data record.

  • In the Record Column table, click the button to add the rows you want to show in each of the tasks to create in Talend Data Stewardship Console.

  • Click in each of the rows and select the column you want show in each of the created tasks. In this example, each task must have four columns: Firstname, Lastname, PostalCode and DOB.

Note

You can click the button to add all your input schema in one go without having to add it row by row.

  • Double-click the tLogRow component to display its Basic settings view and define the component properties.

  • Save your Job and press F6 to execute it.

The Run console displays the four columns from the input flow.

The identifier for each group (task) is listed in the GID column next to the corresponding record. The number of records in each of the tasks is listed in the GRP_SIZE column and computed only on the master record. The MASTER column indicates with "true" that the corresponding record is a master record. The SCORE column lists the calculated distance between the input record and the master record according to the Jargo-Winkler matching algorithm.

All other input records that have a matching distance are listed in Talend Data Stewardship Console waiting for a data steward to merge, compare and resolve the data records.