Migration Methodology - From Data Stewardship Console To Talend Data Stewardship

author
Michael Gainhao
EnrichVersion
6.4
EnrichProdName
Talend Data Fabric
Talend MDM Platform
task
Installation and Upgrade
EnrichPlatform
Talend Data Stewardship
Talend MDM Web UI

Migration Methodology - From Data Stewardship Console To Talend Data Stewardship

This article provides guidance on how to migrate tasks and data from Data Stewardship Console (DSC) to the new Talend Data Stewardship (TDS).

You will need to have in-depth knowledge of the following to complete the steps described below:

  • Talend Data Integration
  • Data Stewardship Console
  • Talend Data Stewardship
Prerequisites

It is imperative that both products are installed and configured correctly. The following components should be up and running:

Data Stewardship Console (DSC)

Talend Data Stewardship (TDS)

  • Talend Data Stewardship
  • Talend Administration Center (User Management)
  • Apache Kafka with Zookeeper
  • MongoDB
  • Talend Dictionary Service

The Data Stewardship Console (DSC) should have tasks and data already loaded for the migration.

User Management

The User Management in Talend Data Stewardship Console (DSC) and Talend Data Stewardship (TDS) are very different:

  DSC (In MDM) TDS
User Management MDM Application Talend Administration Center
Roles N/A
  • Data Steward
  • Campaign Owner
Login Format Simple User Id (e.g. administrator) Email Format (e.g. downer@company.com)

Due to the format differences and depending on the number of users, user migration can be performed either manually (if number of users is few) or programmatically (if number of users is high).

Manual Creation Of Users

It consists of creating the users manually in the Talend Administration Center. The user attributes and role(s) can be setup as shown in the screenshot below.

Programmatic Creation Of Users

We can also develop a process which reads users from the Data Stewardship Console and create them using the Metaservlet API (createUser command) of Talend Administration Center.

MetaServlet Command : createUser Expand source

----------------------------------------------------------
 Command: createUser
----------------------------------------------------------
Description             : create a new user.
- 'count_policy' : [NAMED|CONCURRENT] NAMED for 'Named users', CONCURRENT for 'Concurrent users', default value is CONCURRENT
Requires authentication : true
Since                   : 3.2
Sample                  : 
{
  "actionName": "createUser",
  "authPass": "admin",
  "authUser": "admin@company.com",
  "count_policy": "NAMED",
  "dataPrep": true,
  "dataPrepRole": [
    "Administrator",
    "Dataset Manager",
    "Data Preparator"
  ],
  "tds": true,
  "tdsRole": [
    "Data Steward",
    "Campaign Owner"
  ],
  "userFirstName": "john",
  "userGitLogin": "jsmith",
  "userGitPassword": "7ob5iT3c",
  "userGroup": [],
  "userLastName": "smith",
  "userLogin": "john.smith@company.com",
  "userPassword": "kkE432",
  "userRole": [
    "Administrator",
    "Designer"
  ],
  "userSvnLogin": "jsmith",
  "userSvnPassword": "gvZ543uc",
  "userType": "DI"
}
Data Structures

In the Data Stewardship Console, data structures (aka schemas) are managed in Jobs at the component level or Write Level ( tStewardshipTaskOutput ).

In DSC, data structures do not have constraints or specific policies on data types or access rights.

In Talend Data Stewardship (TDS), schema management is done through the data models you can create within the application:

A schema defines the (flat) structure of a record. The supported field types are:

  • Text
  • Integer
  • Decimal
  • Date “yyyy-MM-dd” (UTC) or number of days since epoch (int)
  • Time “HH:mm:ss” or number of seconds since beginning of the day (int)
  • Timestamp “yyyy-MM-dd HH:mm:ss” (UTC) or number of milliseconds since epoch (long)
  • Boolean
  • Data Quality Semantic types

A field can be mandatory (optional by default).

A field can have constraints depending on its type (eg: min/max values for numeric fields, enums for text fields ...).

A schema can be shared among multiple campaigns in TDS.

How To Migrate Schemas?

The best approach is to create the data model from scratch in the new Talend Data Stewardship application for multiple reasons as described below.

We can customize each attributes with:

  • A technical name (used in jobs)

  • A business name (displayed in the Web UI)

  • A description

  • A data type

We can also define data quality rules:

  • A list of value
  • Options based on data types

  • Patterns

  • Mandatory

These options described above do not exist in Data Stewardship Console.

Campaigns

A Campaign is a brand new concept in Talend Data Stewardship (TDS) and it is required.

In Talend Data Stewardship, a campaign:

  • Is owned by one or several Campaign Owner(s)
  • Is attached to a schema
  • Is defined by
    • A type
    • A list of participants: stewards that can participate to the campaign grouped by roles
    • Permissions on the schema attributes for each role
    • A workflow to orchestrate the campaign (state machine) whose each step is assigned to one or multiple roles

It is important to create the necessary campaigns before migrating any data. The campaigns can be created by Campaign Owners.

Only 2 types of tasks exist in Data Stewardship Console (DSC):

  • Resolution

  • Data

In Talend Data Stewardship (TDS), these 2 types of tasks are equivalent of:

  • Merging Campaign
  • Resolution Campaign
Merging Campaign

A Merging Campaign task in TDS is the equivalent of a Resolution task in DSC:

  • Task consists in a set of source records and a golden record
  • Goal is to build the golden record for the task

Resolution Campaign

A Resolution Campaign in TDS is the equivalent of Data Tasks in the DSC:

  • Task consists in a single record
  • Goal is to correct the record fields values

    .

Metadata Migration

To migrate Data Stewardship Console tasks, we need to understand the metadata of a task:

  • Tag
  • Star/Priority
  • Status
  • Task Owner
Tags

Tags are used to manage categories and to identify a group of tasks based on a business need.

Tags cannot be retrieved at the component level, it is a parameter of the tStewardshipTaskInput component. There is no out of the box specific DSC components to read the tags only.

To retrieve all tags of a specific task, we can develop a data integration job which reads the tag information from the DSC database. From the tasks table, we can get the tag keys for each task (as shown below):

Using these keys, we can retrieve tag names from the tags table:

In the migration job shown below, we can read directly from the database and retrieve all the tags for each task:

Running the above job will produce results similar as shown below, with the task_id, tags and labels. This way we can migrate this information to the Talend Data Stewardship application.

Stars (also known as Priority)

Stars are used to define the importance or priority for a task. It is displayed as shown below in the DSC user interface.

The star rank can be retrieved from DSC using the tStewardshipTaskInput component.

The stars can defined the priority of a DSC task. The levels of the priority in the new Talend Data Stewardship (TDS) are shown below. We will need to map the stars to the levels shown below for this migration.

In order to map the Star rank to the Priority level, the following table is used:

Data Stewardship Console Talend Data Stewardship

Why 0 stars is equivalent to medium? Because they are both the default value in each system.

Status

In Data Stewardship Console (DSC), a task can be:

  • New : New task and no action done on it
  • Pending : Task modified but not approved
  • Locked : Task which can not be modified but not yet approved
  • Resolved : Task is approved and ready to be processed

The status can be retrieved using the tStewardshipTaskInput component:

In Talend Data Stewardship (TDS), there are three status:

  • New
  • Resolved
  • To Validate : Only available when a campaign uses a validation workflow.

In order to map these statuses, the following table is used:

Data Stewardship Console Talend Data Stewardship
New New
Locked New
Pending New
Resolved Resolved
Task Owner

In Data Stewardship Console (DSC), a task is only assigned to one user.

The ownership of a task can be retrieved from the tStewardshipTaskInput component:

TASK_OWNER column can be used to retrieve the new login for the Talend Data Stewardship, using a join table. Or you can set all tasks to unassigned, so the users will be able to pick their tasks.

Example of a join table:

Data Stewardship Console Talend Data Stewardship
administrator jsmith@company.com
rbrown rbrown@company.com
lcaufield lcaufield@company.com

In the sample migration job, we choose to use the unassigned scenario as our users did not exist before.

Migration Jobs

In the zip file attached below, we have 4 jobs:

The first two jobs in DSC folder are only meant to be used as a demo:

  • DataTasksSample : It loads sample data into data tasks (DSC)
  • ResolutionTasksSample : It loads sample data into resolution tasks (DSC)

The two other jobs can be used as a starting point for your migration requirements.

  • MigrateDataTasks : Take Data task from DSC and push them to the new TDS (Resolution tasks)
  • MigrateResolutionTasks : Migrate Resolution tasks from DSC to Merging tasks in TDS

We will be focusing our attention on the last two jobs. The first step in both jobs is to retrieve tags for each task. As we saw previously, we can define multiple tags for a task and you can't retrieve them directly. We need to process them beforehand.

The second step is dedicated to retrieving task data using the tStewardshipTaskInput component. Then, we join the results with tags and get two outputs: task data, task metadata.

We need to split them because metadata will be used to configure the third step. Getting task metadata is different from retrieving the actual data for the task.

Resolution tasks need two things:

  • Deduplication due to the number of rows. In Data task we have only one row per task so it is ok.
  • Generate a GID, because we cannot use task id which is an UUID. GID must be an integer.

The third step is sending tasks to the new Talend Data Stewardship. For that, we use metadata and iterate on each task. It allows us to have the configuration as global variables.

This configuration is used to enable the new tDataStewardshipTaskOutput component to grab the values from the FlowToIterate component.

To adjust the job to your needs, you have to change the following three things:

  • Schema : Columns in the test jobs will be different for you use case as the actual data will be different.
  • tDataStewardshipTaskOutput : It is important to select the campaign you have defined.
  • Database Components : Adjust all the database components to match your requirements. In the example provide, the database components are MySQL. But you may be using a different database.
Jobs migration

To adapt all existing DSC jobs that are writing data to the DSC, the following changes are recommended for TDS:

Previous version (DSC) - Writing to DSC :

New version (TDS) - Writing to TDS:

We added a tMap to map old columns to the new structure.

You can then adjust the tDataStewardshipTaskOutput component settings to match your requirements.

Sample jobs: DataStewardship-Sample.zip TDS-Migration.zip