tDataStewardshipTaskOutput - 6.3

Talend Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

tDataStewardshipTaskOutput connects to the Talend Data Stewardship server and loads data into campaigns in the form of tasks. The tasks must have the same schema defined in the campaign.

An authorized campaign participant can then intervene on the tasks and use the capabilities provided by Talend Data Stewardship to resolve the tasks.

For further information about Talend Data Stewardship, see the Talend Data Stewardship documentation on Talend Help Center (https://help.talend.com).

tDataStewardshipTaskOutput properties in Standard Jobs

Basic settings

These properties are used to configure the tDataStewardshipTaskOutput component which runs in the Standard Job framework.

The Standard tDataStewardshipTaskOutput component belongs to the Talend Data Stewardship family.

Property Type

Either Built-in or Repository.

 

Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

Schema and Edit schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Sync columns to retrieve the schema from the previous component in the Job.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

Once you select a campaign from the Find a campaign list, tDataStewardshipTaskOutput becomes aware of the campaign schema and it creates identical schema columns in its basic settings.

In case you select to write the tasks in a Merging campaign, the below four fields are automatically added to the schema:

  • GID : holds the group identifier.

    This identifier is used by tDataStewardshipTaskOutput to group records in tasks. All source records that should be grouped in a single task must have the same GID.

  • MASTER: indicates if the record is a master record or a source.

    Two cases to consider:

    • If no source record is set as master for a given task, Talend Data Stewardship determines which attributes of matched records to use to create the master record according to the survivorship rules you define when creating the campaign.

    • If more than one source record is set as master for a given task, Talend Data Stewardship takes the first source set as master to be the master record. So the best practice is to have either 0 or one master record per task.

  • SOURCE: name of the source for the record, if any.

  • SCORE: lists the calculated distance between the input and the master records according to the matching algorithm.

If one of the above names is already present in the schema, the TDS_ prefix is added to the field name.

URL

Enter the address to access the Talend Data Stewardship server suffixed with /data-stewardship/, for example http://localhost:8990/data-stewardship/.

Username and Password

Enter the authentication information to the Talend Data Stewardship server.

To enter the password, click the [...] button next to the password field, and then in the pop-up dialog box enter the password between double quotes and click OK to save the settings.

Campaign

Click Find a campaign to open a list of all the campaigns available on the server, and select the campaign from which to delete the tasks.

Label

A read-only field which shows the campaign name once the campaign is selected.

Type

A read-only list which shows the campaign type out of the predefined types once the campaign is selected.

Override enforcement of data model

Select this check box if you want to process data on the Talend Data Stewardship server even if the schema type is not valid, no input validation of the schema is performed. This check box is selected by default with the RESOLUTION campaigns. However, this check box should be selected for all campaign types to guarantee smooth processing of data.

State and Assignee

State: Select from the list the state of the tasks you want to create.

Assignee: Select the campaign participant whose tasks you want to create. Otherwise, select No Assignee to create the tasks without assigning them to anybody.

You can also select Custom and set custom expressions in the fields which are displayed.

Priority, Choice and Tags

Priority (optional): Select any of the task priorities. Otherwise, select Custom and set a custom expression in the field which is displayed.

If no level is selected, Medium is used by default.

Choice (optional, available only when an ARBITRATION campaign is selected): Select any of the choice options set on the records while defining the campaign in the web application. The default value is No Choice, this enables data stewards to do the work in the web application. However, setting an arbitration choice in the Job is a way to help the steward by pre-selecting the supposedly most relevant choice directly when creating the tasks.

Tags (optional): Enter the tag to associate with the tasks you want to create.

You can use the tag(s) to filter the tasks you want to load into the campaign.

Comments

(Optional): Select one or several schema columns and enter the comment you want to add to the tasks you want to create.

The campaign participant will be able to see the comment any time he/she places the pointer on the source record column in Talend Data Stewardship. This information can help him/her making a more informed decision when resolving the task.

Advanced settings

Max tasks per commit

Set the number of lines you want to have in each commit.

Do not change the default value unless you are facing performance issues. Increasing the commit size can improve the performance but setting a too high value could cause Job failures.

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at the Job level as well as at each component level.

Group ID

This field is available only for the Merging campaigns.

It identifies the records to be grouped into the same task.

Master Indicator

This field is available only for the Merging campaigns.

It indicates by true or false if the record is either a golden or a source record respectively.

Source

This field is available only for the Merging campaigns.

It provides the name of the source record, if any.

Score

This field is available only for the Merging campaigns.

It provides the matching score of the source record, if any.

Global variables

NB_LINE

The number of messages processed. This is an After variable and it returns an integer.

NB_REJECT

The number of rows rejected. This is an After variable and it returns an integer.

NB_SUCCESS

The number of rows successfully processed. This is an After variable and it returns an integer.

ERROR_MESSAGE

The error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

Usage

Usage Rule

This component is usually used as an end component of a Job or Subjob and it always needs an input link.

Scenario 1: Writing tasks in Arbitration and Resolution campaigns

This Job loads tasks in two different campaigns defined on the Talend Data Stewardship server according to the criteria you define in the basic settings of the tDataStewardshipTaskOutput components.

The data records in these tasks have some problems. But once they are on the server, an authorized campaign participant can intervene and resolve the tasks.

For further information about Talend Data Stewardship, see the Talend Data Stewardship documentation on Talend Help Center (https://help.talend.com).

Creating a Job to write stewardship tasks on the server

Create a Job which connects to the Talend Data Stewardship server and writes data records in the form of tasks in different campaigns.

Prerequisites:

  • The campaigns in which you want to write the tasks are already defined on the Talend Data Stewardship server and their schema well defined.

  • The tasks you want to write must have the same schema defined in the campaigns.

  • You have been assigned in Talend Administration Center the Campaign Owner role which grants you access to the campaigns on the server.

  1. In the design workspace, start typing tDataStewardshipTaskOutput and select this component from the list that opens. Repeat the operation to add another tDataStewardshipTaskOutput components on the workspace.

  2. Do the same to add two tFileInputDelimited components to the workspace.

  3. Link the tFileInputDelimited components to the tDataStewardshipTaskOutput components using the Row > Main links.

  4. Link the tDataStewardshipTaskOutput components together using the Trigger > OnSubjobOk link.

Reading tasks and sending the fields to the next component

Configure the tFileInputDelimited components to read tasks from input files:

  • the first holds the records of the candidates of a beta testing program, to be written in an Arbitration campaign, namely Beta Candidates.

  • the second holds the records of an enterprise product line to be written in a Resolution campaign, namely Product Catalog.

  1. Double-click each of the tFileInputDelimited components to open its Basic settings view.

  2. Click the [...] buttons next to Edit schema to open dialog boxes where you can define the schemas which correspond to the input file structures.

    Add the below groups of columns in the tFileInputDelimited components:

    • Arbitration tasks: Id, First_name, Last_name, Gender, Age, Occupation, Company, Address, City, State, Zip, Phone and Email.

    • Resolution tasks: Id, Name, Material, Size, Price, Quantity, Family and Packaging.

  3. Click OK in each of the dialog boxes and accept to propagate the changes when prompted.

    Each of the tDataStewardshipTaskOutput components in the Job inherits the schema from the corresponding tFileInputDelimited.

  4. Set the row and field separators in the corresponding fields and the header and footer, if any.

Writing tasks in stewardship campaigns

Configure the tDataStewardshipTaskOnput components to loads tasks in the Beta Candidates and Product Catalog campaigns which are already defined on the Talend Data Stewardship server and which have the same schema as the data in the input files.

  1. Double-click the first tDataStewardshipTaskOutput component to open its Basic settings view.

  2. In the URL field, enter the address of the Talend Data Stewardship server suffixed with /data-stewardship/, for example http://localhost:8990/data-stewardship/.

    In this example, all connection information is defined as context parameters and centralized in the Studio repository. For further information about context parameters, see Talend Studio User Guide.

  3. Enter your login information to the server in the Username and Password fields.

    To enter your password, click the [...] button next to the Password field, enter your password between double quotes in the dialog box that opens and click OK.

  4. Click Find a campaign to open a dialog box which lists the campaigns on the server for which you are the owner or you have the access rights.

  5. Click the column header to sort the list alphabetically for text columns and chronologically for the date column. Select the campaign in which to write the arbitration tasks, Beta Candidates for the first component, and click OK.

    The Campaign, Label and Type fields are automatically filled in with the campaign metadata.

    The schema of the selected campaign is retrieved from the server and is read-only. You can click Edit Schema to display it.

  6. Select the Override enforcement of data model check box to load the new tasks into the campaign even if their schema type does not match what has been defined on the Talend Data Stewardship server.

  7. Set the metadata of the tasks you want to write in the Arbitration campaign as follows:

    • From the State list, select to write the tasks and assign them the New status.

    • From the Assignee list, select the campaign participant to which you want to assign the new tasks in this example. Otherwise, select No Assignee to write the tasks in the campaign pending to be assigned to a participant.

    • From the Priority list, select High as the priority level you want to assign to the tasks.

    • From the Choice list, select No Choice to write the tasks pending a choice.

      Data stewards should then select the relevant choice from the web application.

  8. In the Tag field, enter the tag or tags you want to associate with the tasks, use a comma to separate multiple tags.

    You can use the tag(s) to filter the tasks you want to load into the campaign.

  9. Add columns to the Comments table and enter a comment for theCompany and Occupation columns.

    The campaign participant will be able to see the comment any time he/she places the pointer on the column in Talend Data Stewardship. This information can help him/her making a more informed decision when resolving the task.

  10. Click Advanced settings to open the corresponding view and set the number of tasks you want to have in each commit in the Max tasks per commit field.

  11. Double-click the other tDataStewardshipTaskOutput component and follow the same steps to decide the metadata of the tasks to write in the Resolution campaign.

    This Job writes the resolution tasks in the Product Catalog campaign, does not assign them to any participant, does not set a priority level and does not define any tags on them.

Executing the Job to write tasks in the stewardship campaigns

Once you set up the Job and finalize the configuration of the components, you can execute it to write the tasks into the campaigns defined on the Talend Data Stewardship server and verify the execution results.

  • Press F6 to save and execute the Job.

    Data records from the input files are written in the form of tasks in the selected campaigns on the server.

    The arbitration tasks are already assigned to a specific data steward as defined in the component properties, while the resolution tasks are waiting assignment.

    Authorized data stewards can now access these campaigns and resolve the listed tasks.

    For further information about Talend Data Stewardship, see the Talend Data Stewardship documentation on Talend Help Center (https://help.talend.com).

Scenario 2: Writing tasks in a Merging campaign

This Job loads tasks into a Merging campaign defined on the Talend Data Stewardship server according to the criteria you define in the basic settings of the tDataStewardshipTaskOutput component. The data records in these tasks have duplicates. But once they are on the server, authorized campaign participants can intervene and merge the records.

In this Job:

  • The tFileInputDelimited component reads the customer data.

  • The tMatchGroup component compares data using matching and blocking methods and creates groups of similar encountered duplicates.

  • The tMap component maps the group identifier, GID, generated by tMatchGroup to TDS_GID.

    When the input data has a column which holds the names of the data sources, tMap can also map the input column to TDS_SOURCE.

  • The tDataStewardshipTaskOutput component writes the data in the CRM Data Deduplication campaign on the Talend Data Stewardship server.

Creating a Job to write stewardship tasks in a Merging campaign

Create a Job which connects to the Talend Data Stewardship server and writes data records in the form of tasks into a Merging campaign.

Prerequisites:

  • The campaign in which you want to write the tasks is already defined on the Talend Data Stewardship server and its schema well defined.

  • The tasks you want to write must have the same schema defined in the campaign.

  • You have been assigned in Talend Administration Center the Campaign Owner role which grants you access to the campaigns on the server.

  1. In the design workspace, start typing tDataStewardshipTaskOutput and select this component from the list that opens.

  2. Do the same to add a tFileInputDelimited, tMatchGroup and tMapcomponents onto the workspace.

  3. In the Advanced settings of tMatchGroup, select the Separate output check box to have different output flows for unique, matches and suspect records.

    This enables you to exclude unique records when loading data to Talend Data Stewardship and decide to load either match or suspect records or both.

  4. Link tMatchGroup to tMap using theMatches link, and link the other components together using the Row > Main link.

Reading merging tasks and sending the fields to the next component

Configure the tFileInputDelimited component to read tasks from the input file which holds customer duplicate records.

  1. Double-click tFileInputDelimited to open its Basic settings view.

  2. Click the [...] buttons next to Edit schema to open dialog boxes where you can define the schema which correspond to the input file structure.

    Add the below columns in the tFileInputDelimited component:Id, First_name, Last_name, Gender, Age, Occupation, Company, Address, City, State, Zip, Phone and Email.

  3. Click OK in the dialog box and accept to propagate the changes when prompted.

  4. Set the row and field separators in the corresponding fields and the header and footer, if any.

Creating the match rule to group similar records

Configure the tMatchGroup component to group potential duplicates together based on matching algorithms.

  1. Double-click tMatchGroup to open the configuration wizard where you can define the match rule.

  2. In the Key Definition table, define what match algorithms to use and on what columns. Similarly, in the Blocking Selection table, select what column to use as a blocking value in order to reduce the number of pairs that need to be examined.

    For further information, see tMatchGroup.

  3. Click the Chart button to have the matching results in the wizard and then click OK.

  4. In the component properties, click Advanced settings and make sure the Sort output data by GID check box is selected.

    Note

    If this option is not enabled, potential duplicates could be grouped in different tasks when loaded to Talend Data Stewardship.

  5. Double-click tMap to open its editor.