Translating the scenario into a Job - 6.2

Talend Real-time Big Data Platform Studio User Guide

English (United States)
Talend Real-Time Big Data Platform
Talend Studio
Data Quality and Preparation
Design and Development

In order to implement this scenario, let's break down the Job into four steps:

  1. Creation of the Job, configuration of the input file parameters, and reading of the input file,

  2. Mapping of data and transformations,

  3. Definition of the reference file parameters, relevant mapping using the tMap component, and selection of inner join mode,

  4. Redirection of the output into a MySQL table.

Step 1: Job creation, input definition, file reading

Launch Talend Studio, and create a local project or import the demo project if you are launching Talend Studio for the first time. For more information, see Working with projects.

The main window of Talend Studio is divided into several areas:

  • On the left-hand side: the Repository tree view that holds Jobs, Business Models, Metadata, shared Code, Documentation and so on.

  • In the center: the Editor (main Design area)

  • At the bottom: Component and Job tabs

  • On the right-hand side: the Palette of business or technical components depending on the software tool you are using within Talend Studio.

To the left of the Studio, the Repository tree view that gives an access to:

To create the Job, right-click Job Designs in the Repository tree view and select Create Job.

In the dialog box displaying then, only the first field (Name) is required. Type in California1 and click Finish.

An empty Job then opens on the main window and the Palette of technical components (by default, to the right of the Studio) comes up showing a dozen of component families such as: Databases, Files, Internet, Data Quality and so on, hundreds of components are already available.

To read the file California_Clients, let's use the tFileInputDelimited component. This component can be found in the File/Input group of the Palette. Click this component then click to the left of the design workspace to place it on the design area.

Let's define now the reading properties for this component: File path, column delimiter, encoding... To do so, let's use the Metadata Manager. This tool offers numerous wizards that will help us to configure parameters and allow us to store these properties for a one-click re-use in all future Jobs we may need.

As our input file is a delimited flat file, let's select File Delimited on the right-click list of the Metadata folder in the Repository tree view. Then select Create file delimited. A wizard dedicated to delimited file thus displays:

  • At Step 1, only the Name field is required: simply type in California_clients and go to the next Step.

  • At Step 2, select the input file (California_Clients.csv) via the Browse... button. Immediately an extract of the file shows on the Preview, at the bottom of the screen so that you can check its content. Click Next.

  • At Step 3, we will define the file parameters: file encoding, line and column delimiters... As our input file is pretty standard, most default values are fine. The first line of our file is a header containing column names. To retrieve automatically these names, click Set heading row as column names then click Refresh Preview. And click Next to the last step.

  • At Step 4, each column of the file is to be set. The wizard includes algorithms which guess types and length of the column based on the file first data rows. The suggested data description (called schema in Talend Studio) can be modified at any time. In this particular scenario, they can be used as is.

There you go, the California_clients metadata is complete!

We can now use it in our input component. Select the tFileInputDelimited you had dropped on the design workspace earlier, and select the Component view at the bottom of the window.

Select the vertical tab Basic settings. In this tab, you'll find all technical properties required to let the component work. Rather than setting each one of these properties, let's use the Metadata entry we just defined.

Select Repository as Property type in the list. A new field shows: Repository, click "..." button and select the relevant Metadata entry on the list: California_clients. You can notice now that all parameters get automatically filled out.

At this stage, we will terminate our flow by simply sending the data read from this input file onto the standard output (StdOut).

To do so, add a tLogRow component (from the Logs & Errors group).

To link both components, right-click the input component and select Row/Main. Then click the output component: tLogRow.

This Job is now ready to be executed. To run it, select the Run tab on the bottom panel.

Enable the statistics by selecting the Statistics check box in the Advanced Settings vertical tab of the Run view, then run the Job by clicking Run in the Basic Run tab.

The content of the input file display thus onto the console.

Step 2: Mapping and transformations

We will now enrich our Job to include on-the-fly transformations. To implement these transformation, we need to add a tMap component to our Job. This component is multiple and can handle:

  • multiple inputs and outputs

  • search for reference (simple, cartesian product, first, last match...)

  • join (inner, outer)

  • transformations

  • rejections

  • and more...

Remove the link that binds together the job's two components via a right-click the link, then Delete option. Then place the tMap of the Processing component group in between before linking the input component to the tMap as we did it previously.

Eventually to link the tMap to the standard output, right-click the tMap component, select Row/*New Output* (Main) and click the tLogRow component. Type in out1 in the dialog box to implement the link. Logically, a message box shows up (for the back-propagation of schemas), ignore it by clicking on No.

Now, double-click the tMap to access its interface.

To the left, you can see the schema (description) of your input file (row1). To the right, your output is for the time being still empty (out1).

Drop the Firstname and Lastname columns to the right, onto the Name column as shown on the screen below. Then drop the other columns Address and City to their respective line.

Then carry out the following transformations on each column:

  • Change the Expression of the Name column to row1.Firstname + " " + row1.LastName. Concatenate the Firstname column with the Lastname column following strictly this syntax (in Java), in order for the columns to display together in one column.

  • Change the Expression of the Address column to row1.Address.toUpperCase() which will thus change the address case to upper case.

Then remove the Lastname column from the out1 table and increase the length of the remaining columns. To do so, go to the Schema Editor located at the bottom of the tMap editor and proceed as follows:

  1. Select the column to be removed from the schema, and click the cross icon.

  2. Select the column of which you need increase the length size.

  3. Type in the length size you intend in the length column. In this example, change the length of every remaining column to 40.


As the first name and the last name of a client is concatenated, it is necessary to increase the length of the name column in order to match the full name size.

No transformation is made onto the City column. Click OK to validate the changes and close the Map editor interface.

If you run your Job at this stage (via the Run view as we did it before), you'll notice the changes that you defined are implemented.

For example, the addresses are displayed in upper case and the first names and last names are gathered together in the same column.

Step 3: Reference file definition, re-mapping, inner join mode selection

Define the Metadata corresponding to the LosAngelesandOrangeCounties.txt file just the way we did it previously for California_clients file, using the wizard.

At Step1 of the wizard, name this metadata entry: LA_Orange_cities.

Then drop this newly created metadata to the top of the design area to create automatically a reading component pointing to this metadata.

Then link this component to the tMap component.

Double-click again on the tMap component to open its interface. Note that the reference input table (row2) corresponding to the LA and Orange county file, shows to the left of the window, right under your main input (row1).

Now let's define the join between the main flow and the reference flow. In this use case, the join is pretty basic to define as the City column is present in both files and the data match perfectly. But even though this was not the case, we could have carried out operations directly at this level to establish a link among the data (padding, case change...)

To implement the join, drop the City column from your first input table onto the City column of your reference table. A violet link then displays, to materialize this join.

Now, we are able to use the County column from the reference table in the output table (out1).

Eventually, click the OK button to validate your changes, and run the new Job.

The following output should display on the console.