tOracleCDC - 6.3

Talend Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Function

tOracleCDC extracts source system data that has changed since the last extraction and transports it to another/other system(s).

Purpose

tOracleCDC addresses data extraction and transportation needs. It only extracts the changes done to the source operational data and makes them available to the target system(s) using database CDC views.

tOracleCDC properties

Component family

Databases/Oracle

 

Basic settings

Property of the CDC connection

Either Built-in or Repository.

 

 

Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

 

Use an existing connection

Select this check box and in the Component List click the relevant connection component to reuse the connection details you already defined.

Note

When a Job contains the parent Job and the child Job, if you need to share an existing connection between the two levels, for example, to share the connection created by the parent Job with the child Job, you have to:

  1. In the parent level, register the database connection to be shared in the Basic settings view of the connection component which creates that very database connection.

  2. In the child level, use a dedicated connection component to read that registered database connection.

For an example about how to share a database connection across Job levels, see Talend Studio User Guide.

 

Connection Type

Select a connection type from the drop-down list:

  • Oracle SID: Select this connection type to uniquely identify a particular database on a system.

  • Oracle Service Name: Select this connection type to use the TNS alias that you give when you connect to the remote database.

  • Oracle OCI: Select this connection type to use Oracle Call Interface with a set of C-language software APIs that provide an interface to the Oracle database.

  • Oracle CUSTOM: Select this connection type to access a clustered database.

 

DB Version

Select the version of the Oracle database being used.

 

Host

Specify the host name or IP address of the Oracle database.

This field appears only when Oracle SID or Oracle Service Name is selected from the Connection Type drop-down list.

 

Port

Specify the listening port number of the Oracle database.

This field appears only when Oracle SID or Oracle Service Name is selected from the Connection Type drop-down list.

 

Database

Specify the Oracle database name.

This field appears only when Oracle SID or Oracle Service Name is selected from the Connection Type drop-down list.

 

Local Service Name

Specify the local service name of the Oracle database.

This field appears only when Oracle OCI is selected from the Connection Type drop-down list.

 

URL

Specify the URL of the remote Oracle database.

This field appears only when Oracle CUSTOM is selected from the Connection Type drop-down list.

 

Oracle schema

Specify the name of the Oracle schema.

 

Username and Password

Specify the user authentication data of the Oracle database.

To enter the password, click the [...] button next to the password field, and then in the pop-up dialog box enter the password between double quotes and click OK to save the settings.

 

Schema using CDC and Edit schema

A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository.

Since version 5.6, both the Built-In mode and the Repository mode are available in any of the Talend solutions.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

The schema is set by default according to the CDC mode being used.

 

 

Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

 

CDC mode

Select a CDC mode to be used:

  • Trigger: the CDC trigger mode.

  • Log: the CDC Redo/Archive log mode. This mode is available only when the DB Version is Oracle 11-6 or a previous version.

  • XStream: the CDC XStream mode. This mode is available only when the Connection Type is Oracle OCI and the DB Version is Oracle 12-7.

For more information about CDC modes, see Talend Studio User Guide.

 

Generate LCR type

Select a type for the LCR to be generated, either LCR Object or LCR XML Document.

This list appears only when the CDC mode is XStream.

 

Table using CDC

Specify the source table from which changes made to data are to be captured.

 

Set data table schema

Select this check box if you want to specify the schema of the CDC table.

This check box appears only when the CDC mode is Trigger or Log.

 

Subscriber

Specify the name of the application that will use the change table if you use the Trigger mode to capture changes done on data.

This field appears only when the CDC mode is Trigger.

 

Events to catch

Insert: Select this check box to catch the data inserted in the change table since the last extraction.

Update: Select this check box to catch the data updated in the change table since the last extraction.

Delete: Select this check box to catch the data deleted in the change table since the last extraction.

These check boxes appear only when the CDC mode is Trigger or Log.

 

Limit

Specify the maximum number of consumed rows a subscriber can recover from the change table, per execution.

This field appears only when the CDC mode is Trigger.

 

Outbound server name

Specify the XStream outbound server name.

This field appears only when the CDC mode is XStream.

 

Keep listening

Select this check box to keep monitoring the outbound server for data changes. The Job will not stop until you manually stop it when you need to.

This check box appears only when the CDC mode is XStream.

Advanced settings

Keep data in CDC Table

Select this check box to keep the changes made available to one or more target systems, even after they have been consulted.

This field appears only when the CDC mode is Trigger or Log.

 

Use cursor

Select this check box to specify the number of rows you want to work with at any given time. This option optimises performance.

This field appears only when the CDC mode is Trigger or Log.

 

Trim all the String/Char columns

Select this check box to delete any spaces found at the beginning or end of fields, in all of the columns containing character strings.

This field appears only when the CDC mode is Trigger or Log.

 

Trim column

Remove leading and trailing whitespace from defined columns.

Note

Clear Trim all the String/Char columns to enable Trim columns in the table.

This table appears only when the CDC mode is Trigger or Log.

 

Acknowledge interval(seconds)

Specify the value for the OCI_ATTR_XSTREAM_ACK_INTERVAL attribute for the outbound server. The default value is 30 seconds.

For more infomrmation about this attribute, see http://docs.oracle.com/cd/E18283_01/server.112/e16545/xstrm_oci_intro.htm#CEGDICBB.

This field appears only when the CDC mode is XStream.

 

Idle timeout(seconds)

Specify the value for the OCI_ATTR_XSTREAM_IDLE_TIMEOUT attribute for the outbound server. The default value is 1 second.

For more information about this attribute, see http://docs.oracle.com/cd/E18283_01/server.112/e16545/xstrm_oci_intro.htm#CEGDICBB.

This field appears only when the CDC mode is XStream.

 

tStatCatcher Statistics

Select this check box to collect the log data at component level.

 

Enable parallel execution

Select this check box to perform high-speed data processing, by treating multiple data flows simultaneously. Note that this feature depends on the database or the application ability to handle multiple inserts in parallel as well as the number of CPU affected. In the Number of parallel executions field, either:

  • Enter the number of parallel executions desired.

  • Press Ctrl + Space and select the appropriate context variable from the list. For further information, see Talend Studio User Guide.

Note that when parallel execution is enabled, it is not possible to use global variables to retrieve return values in a subjob.

Warning

  • The Action on table field is not available with the parallelization function. Therefore, you must use a tCreateTable component if you want to create a table.

  • When parallel execution is enabled, it is not possible to use global variables to retrieve return values in a subjob.

Global Variables 

NB_LINE: the number of rows processed. This is an After variable and it returns an integer.

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

This component is used as a start component. It requires an output component and row Main link.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Scenario 1: Populating a data warehouse

The following Java scenario creates a three-component Job that populates a data warehouse. A tOracleInput component reads your client data stored in the CLIENT table. A tMap component allows you to modify this data and the modifications are transmitted to the Leadfact table in the database through a tOracleOutput component.

  1. Drop the following components from the Palette onto the design workspace: tOracleInput, tMap, and tOracleOutput.

  2. Connect the three components using Row Main links. Click the link between tMap and tOracleOutput and name it leadfact.

  3. Double-click tOracleInput to display its Basic settings view and define its properties.

  4. Set Property Type to Repository and then select the connection to the TALEND database that holds the information about your clients. The connection details will display automatically in the corresponding fields.

    Note

    If you have not stored the DB connection details in the Metadata entry in the Repository, select Built-in in the property type list and set the connection details manually.

  5. Set Schema to Repository and click the three-dot button to select the schema of the CLIENT table stored in the Metadata folder.

    Related topics: see Talend Studio User Guide.

  6. In the Table Name field, enter the name of the table holding the information you want to modify, CLIENT in this example.

  7. Click Guess Query to retrieve all data from your table.

  8. Double-click the tMap component to open the Map Editor. Notice that the input area to the left is already filled with the metadata of the input component.

  9. Drag the fields in the input zone to the fields in the leadfact table in the output zone. For more information regarding data mapping, see Talend Studio User Guide.

  10. Click OK to validate the operation.

  11. In the design workspace, double-click tOracleOutput to display its Basic settings view and define its properties.

  12. Set Property Type to Repository and then select the cdc_data connection to the data warehouse. The connection details will display automatically in the corresponding fields

    Note

    If you have not stored the data warehouse connection details in the Metadata folder in the Repository tree view, select Built-in in the property type list and set the connection details manually.

    Related topics: see Talend Studio User Guide.

  13. In the Table Name field, enter the name of the table you want to populate with modified data, LEADFACT in this example.

  14. Save your job and press F6 to create and populate the LEADFACT table in the data warehouse.

Scenario 2: Retrieving modified data using Oracle CDC Redo log mode

Warning

Before modifying data in the LEADFACT table and retrieving these modifications using the tOracleCDC component, you must activate the Redo log mode in the Oracle database that holds the table to monitor. For more information on how to activate the Redo log mode, see Talend Studio User Guide.

This scenario is based on the preceding one. It continuously populates and modifies the data stored in the LEADFACT table, and retrieves and saves, every night, these modifications in a dedicated table using the CDC function. These modifications could be then extracted by the various concerned departments.

Step 1: Configuring CDC

Before being able to retrieve modified data from the LEADFACT table, you must:

  1. first set up the DB connection dedicated to CDC,

  2. second, set up a DB connection to the source data and identify the table to catch,

  3. finally, set the connection between the CDC and the data.

    To do that:

  4. In the Repository tree view and under Metadata, create a connection to your database dedicated to CDC, cdc_publisher in this scenario.

    Note

    Ensure that the DB connection for CDC is on the same server with the source data to which changes are to be captured.

  5. In the Repository tree view and under Metadata, create a connection to the source data warehouse and identify the table to catch, LEADFACT in this scenario.

  6. Right-click the connection to the orcle database and select Retrieve schema from the drop-down menu to retrieve the schema of the table to catch.

  7. Right-click CDC Foundation of the cdc-data connection and select Create CDC in the drop-down menu.

    The [Create Change Data Capture] dialog box displays

  8. Click the three-dot button next to the Set link Connection field and select the connection that corresponds to CDC, cdc_publisher in this example.

  9. In the Options area, select the Log mode check box.

  10. Click Create Subscriber. The [Create Subscriber and Execute SQL Script] dialog box displays.

  11. Click Execute and then Close to close the dialog box.

  12. Click Finish to validate the creation of the subscriber table.

    In the CDC Foundation folder, the relevant subscriber table displays.

    You must specify which table the subscriber wants to subscribe to and then

  13. Right-click the LEADFACT schema of the cdc-data connection and select Add CDC in the drop-down list. The [Create Subscriber and Execute SQL Script] dialog box displays.

    Note

    For Oracle databases and in Talend Studio version 3.2 or subsequent versions, the CDC system creates an alias for the source table(s) monitored. This avoids problems relating to identifier length upon creation of the table of changes and its associated view. For CDC systems which are already in place, the table names are maintained.

  14. Click Execute and then Close to validate the subscription.

    In the CDC Foundation folder, the two created tables display and the schema node of the catched table is marked with a green CDC symbol.

Step 2: Modifying the data

Modify the data of your clients in the LEADFACT table, for example, convert all customer names to upper case.

  1. Double-click the tMap component and enter row1.LASTNAME.toUpperCase()in front of the CLASTNAME column to convert all customer names to upper case.

  2. Click Ok.

  3. Double-click the tOracleOutput component.

  4. In the Action on table field, select None.

  5. In the Action on data field, select Insert or update to insert or update table data.

  6. Save your Job and press F6 to execute it.

To view all changes done on data, right-click the LEADFACT table and select View All Changes to open the relevant dialog box.

Step 3: Extracting change data

After setting up the CDC environment, you can now design a job using the Oracle CDC component to incrementally extract the change data from the LEADFACT table. To do that:

  1. From the Palette, drop the OracleCDC and tLogRow components to the design workspace.

  2. Link the two components using a Row Main link.

  3. Double-click tOracleCDC to open its Basic settings view and define its properties.

  4. Set Property Type to Repository and then select the schema corresponding to your Oracle DB table, cdc_publisher in this scenario. The connection details will display automatically in the corresponding fields

    Note

    If you have not stored the data warehouse connection details in the Metadata folder in the Repository tree view, select Built-in in the property type list and set the connection details manually.

  5. In the Schema using CDC field, select Repository and then select the schema of the LEADFACT table stored in the Metadata folder.

  6. In the Table using CDC field, enter the name of the table captured by the CDC, in this scenario Leadfact.

  7. In the Events to catch field, select the check boxes corresponding to the type of the modified data the subscriber will extract. In this scenario, select the three check boxes for the three subscribers.

  8. Double-click tLogRow to display its Basic settings view and define its properties.

  9. Click the Sync columns button to retrieve the schema from the preceding component.

  10. Click Edit schema to open the schema dialog box.

  11. In the TALEND_CDC_CREATION_DATE line of the Date Pattern column, enter between brackets the desired date format: "yyyy-MM-dd".

  12. Save your Job and press F6 to execute it.

In the Redo log mode, changes done on data are indicated in the following way: modifications are equal to first, an "update and delete" operation (UO), and then to an "update and insert" operation (UN). Thus, client data displays twice:

- First, data is deleted (UO).

-Second, data is inserted (UN).

Once these modifications are extracted, they are no more available in the modified table. To verify their extraction, right-click the LEADFACT table catched by the CDC and then select Views All Changes. The extracted changes do not display anymore.

For another CDC scenario using the Trigger mode, see Scenario 2: Retrieving modified data using CDC.