Scenario: Tracking changes using Slowly Changing Dimensions (type 0 through type 3) - 6.1

Talend Open Studio for Big Data Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Open Studio for Big Data
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This five-component Java scenario describes a Job that tracks changes in four of the columns in a source delimited file, writes changes and the history of changes in an SCD table, and displays error information on the Run console.

The source delimited file contains various personal details including firstname, lastname, address, city, company, age, and status. An id column helps ensuring the unicity of the data.

We want any change in the marital status to overwrite the existing old status record. This type of change is equivalent to an SCD Type 1.

We want to insert a new record in the dimensional table with a separate key each time a person changes his/her company. This type of change is equivalent to an SCD Type 2.

We want to track only the previous city and previous address of a person. This type of change is equivalent to an SCD Type 3.

To realize this kind of scenario, it is better to divide it into three main steps: defining the main flow of the Job, setting up the SCD editor, and finally creating the relevant SCD table in the database.

Defining the main flow of the Job

  1. Drop the following components from the Palette onto the design workspace: a tMysqlConnection, a tFileInputDelimited, a tMysqlSCD, a tMysqlCommit, and two tLogRow components.

  2. Connect the tFileInputDelimited, the first tLogRow, and the tMysqlSCD using the Row Main link. This is the main flow of your Job.

  3. Connect the tMysqlConnection to the tFileInputDelimited and tMysqlSCD to tMysqlCommit using the OnComponntOk trigger.

  4. Connect the tMysqlSCD to the second tLogRow using the Row Rejects link. Two columns, errorCode and errorMessage, are added to the schema. This connection collects error information.

Configuring the DB connection and the input component

  1. In the design workspace, double-click tMysqlConnection to display its Basic settings view and set the database connection details. The tMysqlConnection component should be used to avoid setting several times the same DB connection when multiple DB components are used.

    Note

    If you have already stored the connection details locally in the Repository, drop the needed metadata item to the design workspace and the database connection detail will automatically display in the relevant fields. For more information about Metadata, see Talend Studio User Guide.

    In this scenario, we want to connect to the SCD table where changes in the source delimited file will be tracked down.

  2. In the design workspace, double-click tFileInputDelimited to display its Basic settings view.

  3. Click the three-dot button next to the File Name field to select the path to the source delimited file, dataset.csv in this scenario, that contains the personal details.

  4. Define the row and field separators used in the source file.

    Note

    The File Name, Row separator, and Field separators are mandatory.

  5. If needed, set Header, Footer, and Limit.

    In this scenario, set Header to 1. Footer and limit for the number of processed rows are not set.

  6. Click Edit schema to describe the data structure of the source delimited file.

    In this scenario, the source schema is made of eight columns: id, firstName, lastName, address, city, company, age, and status.

  7. Define the basic settings for the first tLogRow in order to view the content of the source file with varying attributes in cells of a table on the console before being processed through the SCD component.

Configuring tMysqlSCD and tMysqlCommit

  1. In the design workspace, click the tMysqlSCD and select the Component tab to define its basic settings.

  2. In the Basic settings view, select the Use an existing connection check box to reuse the connection details defined on the tMysqlConnection properties.

  3. In the Table field, enter the table name to be used to track changes.

  4. If needed, click Sync columns to retrieve the output data structure from the tFileInputDelimited.

  5. In the design workspace, double-click tMysqlCommit to define its basic settings.

  6. Select the relevant connection on the Component list if more than one connection exists.

  7. Define the basic settings of the second tLogRow in order to view reject information in cells of a table.

Setting up the SCD editor

  1. Double-click the tMysqlSCD component in the design workspace or click the three-dot button next to the SCD Editor in the component's Basic settings view to open the SCD editor and build the data flow for the SCD outputs.

    All the columns from the preceding component are displayed in the Unused panel of the SCD editor. All the other panels in the SCD editor are empty.

  2. From the Unused list, drop the id column to the Source keys panel to use it as the key to ensure the unicity of the incoming data.

  3. In the Surrogate keys panel, enter a name for the surrogate key in the Name field, SK1 in this scenario.

  4. From the Creation list, select the method to be used for the surrogate key generation, Auto-increment in this scenario.

  5. From the Unused list, drop the firstname and lastname columns to the Type 0 panel, changes in these two columns do not interest us.

  6. Drop the status column to the Type 1 panel. The new value will overwrite the old value.

  7. Drop the company column to the Type 2 panel. Each time a person changes his/her company, a new record will be inserted in the dimensional table with a separate key.

    In the Versioning area:

    - Define the start and end columns of your SCD table that will hold the start and end date values. The end date is null for current records until a change is detected. Then the end date gets filled in and a new record is added with no end date.

    In this scenario, we select Fixed Year Value for the end column and fill in a fictive year to avoid having a null value in the end date field.

    - Select the version check box to hold the version number of the record.

    - Select the active check box to spot the column that will hold the True or False status. True for the current active record and False for the modified record.

  8. Drop the address and city columns to the Type 3 panel to track only the information about the previous value of the address and city.

    For more information about SCD types, see SCD management methodologies.

  9. Click OK to validate your configuration and close the SCD editor.

Creating the SCD table

  1. Click Edit schema to view the input and output data structures.

    The SCD output schema should include the SCD-specific columns defined in the SCD editor to hold standard log information.

    Note

    If you adjust any of the input schema definitions, you need to check, and reconfigure if necessary, the output flow definitions in the SCD editor to ensure that the output data structure is properly updated.

  2. In the Basic settings view of the tMysqlSCD component, select Create table if not exists from the Action on table list to avoid creating and defining the SCD table manually.

Job execution

Save your Job and press F6 to execute it.

The console shows the content of the input delimited file, and your SCD table is created in your database, containing the initial dataset.

Janet gets divorced and moves to Adelanto at 355 Golf Rd. She works at Greenwood.

Adam gets married and moves to Belmont at 2505 Alisson ct. He works at Scoop.

Martin gets a new job at Phillips and Brothers.

Update the delimited file with the above information and press F6 to run your Job.

The console shows the updated personal information and the rejected data, and the SCD table shows the history of valid changes made to the input file along with the status and version number. Because the name of Martin's new company exceeds the length of the column company defined in the schema, this change is directed to the reject flow instead of being logged in the SCD table.