Creating a Job script to consolidate a name list from CSV files using tMap - 6.4

Talend Big Data Platform Studio User Guide

EnrichVersion
6.4
EnrichProdName
Talend Big Data Platform
task
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This example shows how to write a Job script to define a simple Job that will consolidate two CSV files into one based on a given column. The Job will contain the following components:

  • two tFileInputDelimited components to read the source CSV files, one of which contains a list of people, and the other contains a list of family names.

  • a tMap component to consolidate the input data to get a list of names based on the id column.

  • a tFileOutputDelimited component to write the consolidated list into another CSV file.

The contents of the input files used in the example are shown below.

Content of the main input file list_people.csv:

id;name;age;city
1;James;75;Richmond
2;Bill;18;Austin
3;George;76;Topeka
4;Herbert;48;Raleigh
5;Woodrow;54;Annapolis
6;Ulysses;56;Frankfort
7;Ulysses;47;Bismarck
8;Woodrow;37;Montgomery
9;Theodore;44;Salem
10;Dwight;38;Little Rock
11;Herbert;26;Salt Lake City
12;Zachary;41;Harrisburg
13;Harry;31;Atlanta
14;Millard;34;Annapolis
15;Ronald;28;Jackson
16;Rutherford;18;Carson City
17;Richard;49;Salt Lake City
18;Woodrow;52;Providence
19;Millard;79;Raleigh
20;James;75;Sacramento

Content of the lookup file list_families.csv:

id;family
1;Cleveland
2;Roosevelt
3;Carter
4;Jefferson
5;Hayes
6;van Buren
7;Kennedy
8;Adams
9;McKinley
10;Buchanan
11;Jefferson
12;Adams
13;Grant
14;Pierce
15;McKinley
16;Clinton
17;Roosevelt
18;Adams
19;Fillmore
20;Fillmore

To create such a Job script, follow the procedures below.

Creating a Job script in the Studio

Follow the steps below to create a new Job script in your Talend Studio and define necessary properties of the Job.

  1. In the Repository tree view, expand the Code node, right-click Job Scripts, and select Create JobScript from the context menu to open the [Create JobScript] dialog box.

  2. Enter a name for your Job script in the Name field, consolidateList in this example. You can leave the other fields as default.

  3. Click Finish to close the dialog box and open a blanc Job script page in the design workspace.

  4. If needed, complete Job properties including:

    • Job type

    • Framework

    • Default context

    • Context type

    • Context parameters

    As this example is meant to create a Job script for standard Job without any context variables, just ignore this step and let the Studio complete the properties with the default settings when creating your Job from Job this script.

Defining the input components

In most circumstances you can start writing a Job script by adding the needed components. Follow the steps below to add and configure the components.

  1. In the Job script page, enter an addComponent {} function to start adding a component.

  2. In the addComponent {} function, enter the setComponentDefinition {} function and its parameters to define the component.

    Shown below are the minimal parameters required to define the type, the unique name, and the position of the first tFileInputDelimited component in this example.

    addComponent {
    	setComponentDefinition {
    		TYPE: "tFileInputDelimited",
    		NAME: "tFileInputDelimited_1",
    		POSITION: 192, 256
    	}
    }
    
  3. Next to the setComponentDefinition {} function, enter the setSettings {} function to specify the path to the source file, the number of header rows to skip, and optionally the label of the component displayed in the design workspace. In this example:

    	setSettings {
    		FILENAME : "\"E:/Talend/Data/Input/list_people.csv\"",
    		HEADER : "1",
    		LABEL : "people"
    	}

    Warning

    Be sure to use a backslash (\) when specifying a metacharacter.

  4. Next to the setSettings {} function, type in an addSchema {} function and column definition parameters to define the component schema. For each column, just specifies the name and type and leave the other column properties as default.

    In this example, the main source file has four columns, all of type String: id, name, age, and city.

    	addSchema {
    		NAME: "tFileInputDelimited_1",
    		CONNECTOR: "FLOW"
    		addColumn {
    			NAME: "id",
    			TYPE: "id_String"
    		}
    		addColumn {
    			NAME: "name",
    			TYPE: "id_String"
    		}
    		addColumn {
    			NAME: "age",
    			TYPE: "id_String"
    		}
    		addColumn {
    			NAME: "city",
    			TYPE: "id_String"
    		}
    	}
  5. Use the steps above to add the second tFileInputDelimited component, which will read a lookup input file that contains only two columns, both of type String: id and family.

    addComponent {
    	setComponentDefinition {
    		TYPE: "tFileInputDelimited",
    		NAME: "tFileInputDelimited_2",
    		POSITION: 192, 128
    	}
    	setSettings {
    		FILENAME : "\"E:/Talend/Data/Input/list_families.csv\"",
    		HEADER : "1",
    		LABEL : "families"
    	}
    	addSchema {
    		NAME: "tFileInputDelimited_2",
    		CONNECTOR: "FLOW"
    		addColumn {
    			NAME: "id",
    			TYPE: "id_String"
    		}
    		addColumn {
    			NAME: "family",
    			TYPE: "id_String"
    		}
    	}
    }

Defining the processing component

Follow the steps below to define a tMap component to:

  • perform automatic type conversion between input and output to prevent compiling errors at Job execution

  • combine the first name and last name of each person

  1. Enter the following functions and parameters to add the component.

    addComponent {
    	setComponentDefinition {
    		TYPE: "tMap",
    		NAME: "tMap_1",
    		POSITION: 480, 256
    	}
    }
  2. Next to the setComponentDefinition {} function, enter the setSettings {} function to define the mapper settings. In this example, enable the automatic type conversion feature of the component and leave the other settings as default.

    In this example, the data type of the id and age columns is String in the input schemas while it's Integer in the output schema.

    	setSettings {
    		ENABLE_AUTO_CONVERT_TYPE : "true"
    	}
  3. Next to the setSettings {} function, enter an addSchema {} function to define the data structure expected by the next component.

    In this example, the output flow is named out and it contains four columns:

    • id, type Integer, two characters long

    • full_name, type String

    • age, type Integer, two characters long

    • city, type String

    		addSchema {
    		NAME: "out",
    		CONNECTOR: "FLOW",
    		LABEL: "out"
    		addColumn {
    			NAME: "id",
    			TYPE: "id_Integer",
    			LENGTH: 2
    		}
    		addColumn {
    			NAME: "full_name",
    			TYPE: "id_String"
    		}
    		addColumn {
    			NAME: "age",
    			TYPE: "id_Integer",
    			LENGTH: 2
    		}
    		addColumn {
    			NAME: "city",
    			TYPE: "id_String"
    		}
    	}
  4. Next to the addSchema {} function, enter the addMapperData {} function to define the mapping data, which includes input, output, and var tables, joins, and mappings.

  5. In the addMapperData {} function, enter an addInputTable {} function to define the input table for the main input flow. Node that the column definitions must be the same as those for the first tFileInputDelimited component.

    	addMapperData {
    		addInputTable {
    			NAME: "row1"
    			addColumn {
    				NAME: "id",
    				TYPE: "id_String"
    			}
    			addColumn {
    				NAME: "name",
    				TYPE: "id_String"
    			}
    			addColumn {
    				NAME: "age",
    				TYPE: "id_String"
    			}
    			addColumn {
    				NAME: "city",
    				TYPE: "id_String"
    			}
    		}
    	}
  6. In the addMapperData {} function, enter another addInputTable {} function to define the input table for the lookup flow. Node that the column definitions must be the same as those for the second tFileInputDelimited component.

  7. In the definition for the id column, enter the parameter EXPRESSION: "row1.id" to set up a join between the two input tables on the id column.

    Note that this example defines a Left Outer Join. To define an Inner Join, add the ISINNERJOIN: true parameter in the addInputTable {} function.

    		addInputTable {
    			NAME: "row2"
    			addColumn {
    				NAME: "id",
    				TYPE: "id_String"
    				EXPRESSION: "row1.id"
    			}
    			addColumn {
    				NAME: "family",
    				TYPE: "id_String"
    			}
    		}
  8. In the addMapperData {} function, enter an addOutputTable {} function and define the only output table in this example. Node that the column definitions must be the same as those defined in the schema settings.

    Note that the ID parameter is required, but it needs a value only when the output table uses a Repository schema.

  9. Create mappings between the input and output columns by adding the EXPRESSION parameter to each output column.

    Note that the full_name column is a combination of the name column of the main input flow and the family column of the lookup flow, with a space in between.

    		addOutputTable {
    			ID: "",
    			NAME: "out"
    			addColumn {
    				NAME: "id",
    				TYPE: "id_Integer",
    				EXPRESSION: "row1.id"
    			}
    			addColumn {
    				NAME: "full_name",
    				TYPE: "id_String",
    				EXPRESSION: "row1.name  + \" \" + row2.family"
    			}
    			addColumn {
    				NAME: "age",
    				TYPE: "id_Integer",
    				EXPRESSION: "row1.age"
    			}
    			addColumn {
    				NAME: "city",
    				TYPE: "id_String",
    				EXPRESSION: "row1.city "
    			}
    		}

    Warning

    Be sure to use a backslash (\) when specifying a metacharacter.

Defining the output component

Follow the steps below to add and define the output component in this example, tFileOutputDelimited.

  1. Next to the tMap component definition, enter another addComponent {} function, and inside this function enter the setComponentDefinition {} function and its minimal required parameters to add the output component.

    addComponent {
    	setComponentDefinition {
    		TYPE: "tFileOutputDelimited",
    		NAME: "tFileOutputDelimited_1",
    		POSITION: 736, 256
    	}
    }
  2. Next to the setComponentDefinition {} function, enter the setSettings {} function and parameters required to define the component properties:

    • The path to the output file

    • Whether to include the header in the output file

    • The encoding type of the output file

    • Whether to throw an error and stop Job execution if the specified output file already exists

    • The label of the component displayed in the generated Job on the design workspace

    	setSettings {
    		FILENAME : "\"E:/Talend/data/output/consolidated_list.csv\"",
    		INCLUDEHEADER : "true",
    		ENCODING : "\"UTF-8\"",
    		ENCODING:ENCODING_TYPE : "UTF-8",
    		FILE_EXIST_EXCEPTION : "false",
    		LABEL : "consolidated_list"
    	}

    Warning

    Be sure to use a backslash (\) when specifying a metacharacter.

  3. Next to the setSettings {} function, enter an addSchema {} function to define the schema of the output component. Note that the column definitions of this schema must be the same as those of the output schema defined in the preceding component, tMap.

    addSchema {
    		NAME: "tFileOutputDelimited_1",
    		CONNECTOR: "FLOW",
    		LABEL: "out"
    		addColumn {
    			NAME: "id",
    			TYPE: "id_Integer",
    			LENGTH: 2
    		}
    		addColumn {
    			NAME: "full_name",
    			TYPE: "id_String"
    		}
    		addColumn {
    			NAME: "age",
    			TYPE: "id_Integer",
    			LENGTH: 2
    		}
    		addColumn {
    			NAME: "city",
    			TYPE: "id_String"
    		}
    	}

Defining the connections linking the components

Follow the steps below to add and define the connections that link the components in the Job.

  1. Next to the output component definition, enter an addConnection {} function to add a connection.

  2. In the addConnection {} function, enter the parameters required to define the connection properties:

    • Connection type

    • Connection name

    • Connection style

    • Source component

    • Target component

    The example below defines a main row connection between the first tFileInputDelimited component and the tMap component.

    Warning

    The name of this connection must match the main input table name defined in the tMap settings.

    addConnection {
    	TYPE: "FLOW",
    	NAME: "row1",
    	LINESTYLE: 0,
    	SOURCE: "tFileInputDelimited_1",
    	TARGET: "tMap_1"
    }
  3. Enter another addConnection {} function and define a lookup row connection between the section tFileInputDelimited component and the tMap component.

    Warning

    The name of this connection must match the lookup input table name defined in the tMap settings.

    addConnection {
    	TYPE: "FLOW",
    	NAME: "row2",
    	LINESTYLE: 8,
    	SOURCE: "tFileInputDelimited_2",
    	TARGET: "tMap_1"
    }
  4. Enter another addConnection {} function and define a main row connection between the tMap component and the tFileoutputDelimited component.

    Warning

    The name of this connection must match the corresponding output schema and output table names defined in the tMap.

    addConnection {
    	TYPE: "FLOW",
    	NAME: "out",
    	LINESTYLE: 0,
    	SOURCE: "tMap_1",
    	TARGET: "tFileOutputDelimited_1"
    }

Generating and executing the Job

Now you have set up all the components and connections required for this example, you can generate your Job from the Job script and execute it in your Studio.

  1. Save your Job script, then right-click the Job script in the Repository tree view, and select Generate Job from the contextual menu.

  2. If needed, double-click the generated Job in the Repository tree view and check the Job design in the Designer tab view.

  3. Press F6 or click the Run button on the Run console to execute the Job.

    As the expected result, the output file consolidated_list.csv should read:

    id;full_name;age;city
    1;James Cleveland;75;Richmond
    2;Bill Roosevelt;18;Austin
    3;George Carter;76;Topeka
    4;Herbert Jefferson;48;Raleigh
    5;Woodrow Hayes;54;Annapolis
    6;Ulysses van Buren;56;Frankfort
    7;Ulysses Kennedy;47;Bismarck
    8;Woodrow Adams;37;Montgomery
    9;Theodore McKinley;44;Salem
    10;Dwight Buchanan;38;Little Rock
    11;Herbert Jefferson;26;Salt Lake City
    12;Zachary Adams;41;Harrisburg
    13;Harry Grant;31;Atlanta
    14;Millard Pierce;34;Annapolis
    15;Ronald McKinley;28;Jackson
    16;Rutherford Clinton;18;Carson City
    17;Richard Roosevelt;49;Salt Lake City
    18;Woodrow Adams;52;Providence
    19;Millard Fillmore;79;Raleigh
    20;James Fillmore;75;Sacramento