Preparation versions can be used in data integration or Big Data Jobs in Talend Studio.
In Talend Studio, the
tDataprepRun component allows you to reuse a preparation, or any of
its versions, and apply it on data with the same model.
You still have the possibility to use a preparation in its current state, but using a
specific version can ensure that it is always the same state of a preparation that is
used in your Jobs, even if the preparation is still being worked on, thus providing more
consistency.
The following example will illustrate a Job that applies an existing preparation version
on a Salesforce input, and outputs it to a Redshift database.
This preparation was made on a dataset containing basic customer information such as
names, phone numbers and email addresses. A few steps have been applied to remove
formatting errors in the name entries, and to delete invalid values from the phone
numbers.
Two versions have been created during the preparation: one after the first two steps, and
another one after the third step.
Before you begin
- You have created a preparation with at least one version in Talend Data Preparation. In this case the existing preparation is called contacts
cleansing.
- The data imported from salesforce must have the same schema as the dataset used
to create the preparation in the first place.
Procedure
-
In Talend Studio,
create a new Standard or Spark Job.
-
In the design workspace of Talend Studio, add a
tSalesforceInput, a tDataprepRun,
a tRedshiftOutput, and link them together using two links.
-
Select the tSalesforceInput component and click the
Component tab to define its basic settings.
Make sure that the schema of the tSalesforceInput
component matches the schema expected by the
tDataprepRun component.
-
Select the tDataprepRun component and click the
Component tab to define its basic settings.
-
Enter your Talend Data Preparation connection information.
-
Click Choose an existing preparation to display a list
of the prepations available in Talend Data Preparation.
-
Select the checkbox in front of contacts cleansing, that
contains the preparation version that you want to apply, and click
OK.
-
Click choose a version to select from the list of
available versions for your preparation. In this case, select version
1.
By default, the Job uses the current state of the
selected preparation. Using the current state instead
of a fixed version means that in the context of collaborative work, someone
possibly made changes, that you are unaware of, on the preparation. As a
consequence you cannot know exactly what the outcome of your Job will be.
This is why it is safer to use a version in your Jobs.
-
Click Fetch Schema to retrieve the schema of
contacts cleansing.
-
Select the tRedshiftOutput component and click the
Component tab to define its basic settings.
-
Save your Job and press
F6
to run
it.
Results
All the preparation steps included in the version of the preparation have been
applied to your data, directly in the flow of your Job.