tDataprepRun - 6.3

Talend Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Function

tDataprepRun fetches a preparation made using Talend Data Preparation and applies it to a set of data.

Purpose

This component allows you to integrate and apply a preparation made using Talend Data Preparation in a standard Data Integration Job.

Depending on the Talend solution you are using, this component can be used in one, some or all of the following Job frameworks:

  • Standard: see tDataprepRun properties.

    The component in this framework is available when you have subscribed to one of the Talend Platform products or Talend Data Fabric.

  • Spark Batch: see tDataprepRun properties in Spark Batch Jobs.

    The component in this framework is available only if you have subscribed to one of the Talend solutions with Big Data.

tDataprepRun properties

Component Family

Talend Data Preparation

 

Basic settings

URL

Type the URL to the Talend Data Preparation web application, between double quotes.

 

Username

Type the email address that you use to log in the Talend Data Preparationweb application, between double quotes.

 

Password

Click the [...] button and type your user password for the Talend Data Preparation web application, between double quotes.

 

Preparation

To complete the Preparation field, click one of the following:

  • Choose an existing preparation to select from a list of the preparations that were previously created in Talend Data Preparation.

  • Or create a new one to create a new preparation based on your input data.

 

Click this button to edit the preparation in Talend Data Preparation that corresponds to the ID defined in the Preparation field.

 

Fetch Schema

Click this button to retrieve the schema from the preparation defined in the Preparation field.

 

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

Click Sync columns to retrieve the schema from the previous component connected in the Job.

Advanced settings

Limit Preview

Specify the number of rows to which you want to limit the preview.

 

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at the Job level as well as at each component level.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

This component is an intermediary step. It requires an input flow as well as an output.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Limitations

  • If the dataset is updated after the tDataprepRun component has been configured, the schema needs to be fetched again.

  • If a context variable was used in the URL of the dataset, you cannot use the button to edit the preparation directly in Talend Data Preparation

Scenario: Preparing data from a database in a Talend Job

The tDataprepRun component allows you to reuse an existing preparation made in Talend Data Preparation, directly in a data integration Job. In other words, you can operationalize the process of applying a preparation to input data with the same model.

The following scenario creates a simple Job that :

  • retrieves a table from a MySQL database, that holds some employee-related data,

  • applies an existing preparation on this data,

  • outputs the prepared data into an Excel file.

This assumes that a preparation has been created beforehand, on a dataset with the same schema as your input data for the Job. In this case, the existing preparation is called datapreprun_scenario. This simple preparation puts the employees last names into upper case and isolate the employees with a salary greater than 1500$.

Adding and linking the components

  1. In the Integration perspective of the Studio, create an empty Standard Job from the Job Designs node in the Repository tree view.

  2. Drop the following components from the Palette onto the design workspace: tMysqlInput, tDataprepRun and tFileOutputExcel.

  3. Connect the three components using Row > Main links.

Configuring the components

Retrieving the data from the database

  1. In the design workspace, select tMysqlInput and click the Component tab to define its basic settings.

  2. In the Property Type list, select Built-in to set the database connection details manually.

  3. In the DB Version list, select the version of MySQL you are using, MySQL 5 in this example.

  4. In the Host, Port, Database, Username and Password fields, enter the MySQL connection details and the user authentication data for the database, between double quotes.

  5. In the Table Name field, type the name of the table to be read, between double quotes.

  6. In the Query field, enter your database query between double quotes. In this example, the query is select * from employees to retrieve all of the information from the table employees, in the test database.

  7. Click Guess schema to automatically retrieve the schema from the database table or enter the schema manually by clicking the [...] button next to Edit schema.

    Make sure that the schema of the tMysqlInput component matches the schema expected by the tDataprepRun component. In other words, the input schema must be the same as the dataset upon which the preparation was made in the first place.

Accessing the preparation from Talend Data Preparation

  1. In the design workspace, select tDataprepRun and click the Component tab to define its basic settings.

  2. In the URL field, type the URL of the Talend Data Preparation web application, between double quotes. Port 9999 is the default port for Talend Data Preparation.

  3. In the Username and Password fields, enter your Talend Data Preparation connection information, between double quotes.

  4. Click Choose an existing preparation to display a list of the prepations available in Talend Data Preparation, and select datapreprun_scenario.

    This scenario assumes that a preparation with a compatible schema has been created beforehand.

  5. Click Fetch Schema to retrieve the schema of the preparation.

    The output schema of the tDataprepRun component now reflects the changes made with each preparation step. The schema takes into account columns that were added or removed for example.

Outputting the preparation into an Excel file

  1. In the design workspace, select tFileOutputExcel and click the Component tab to define its basic settings.

  2. In the File Name field, enter the location where you want to save the result of the preparation.

  3. Click Sync columns to retrieve the new output schema, inherited from the tDataprepRun component.

Saving and executing the Job

  1. Save your Job and press F6 to execute it.

  2. You can now open the Excel file containing the result of the preparation applied on your data from the MySQL database.

tDataprepRun properties in Spark Batch Jobs

Component Family

Talend Data Preparation

 

Basic settings

URL

Type the URL to the Talend Data Preparation web application, between double quotes.

 

Email

Type the email address that you use to log in the Talend Data Preparationweb application, between double quotes.

 

Password

Click the [...] button and type your user password for the Talend Data Preparation web application, between double quotes.

 

Preparation

To complete the Preparation field, click Choose an existing preparation to select from a list of the preparations that were previously created in Talend Data Preparation.

 

Click this button to edit the preparation in Talend Data Preparation that corresponds to the ID defined in the Preparation field.

 

Fetch Schema

Click this button to retrieve the schema from the preparation defined in the Preparation field.

 

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

Click Sync columns to retrieve the schema from the previous component connected in the Job.

Advanced settings

Encoding

Select an encoding mode from this list. You can select Custom from the list to enter an encoding method in the field that appears.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

This component is an intermediary step. It requires an input flow as well as an output.

Limitations

  • If the dataset is updated after the tDataprepRun component has been configured, the schema needs to be fetched again.

  • If a context variable was used in the URL of the dataset, you cannot use the button to edit the preparation directly in Talend Data Preparation.

  • The Make as header and Delete row functions, as well as any modification of a single cell, are ignored by the tDatarepRun component. These functions only affect a single row or cell and are thus not compatible with a Big Data context. In the list of existing preparations to choose from, a warning is displayed next to preparations that include incompatible actions.

Scenario: Applying a preparation to a data sample

The tDataprepRun component allows you to reuse an existing preparation made in Talend Data Preparation, directly in a Big Data Job. In other words, you can operationalize the process of applying a preparation to input data with the same model.

The following scenario creates a simple Job that :

  • Reads a small sample of customer data,

  • applies an existing preparation on this data,

  • shows the result of the execution in the console.

This assumes that a preparation has been created beforehand, on a dataset with the same schema as your input data for the Job. In this case, the existing preparation is called datapreprun_spark. This simple preparation puts the customer last names into upper case and applies a filter to isolate the customers from California, Texas and Florida.

The sample data reads as follows:

James;Butt;California
Daniel;Fox;Connecticut
Donna;Coleman;Alabama
Thomas;Webb;Illinois
William;Wells;Florida
Ann;Bradley;California
Sean;Wagner;Florida
Elizabeth;Hall;Minnesota
Kenneth;Jacobs;Florida
Kathleen;Crawford;Texas
Antonio;Reynolds;California
Pamela;Bailey;Texas
Patricia;Knight;Texas
Todd;Lane;New Jersey
Dorothy;Patterson;Virginia

Note that the sample data is created for demonstration purposes only.

Prerequisite: ensure that the Spark cluster has been properly installed and is running.

Adding and linking the components

  1. In the Integration perspective of the Studio, create an empty Spark Batch Job from the Job Designs node in the Repository tree view.

    For further information about how to create a Spark Batch Job, see Talend Big Data Getting Started Guide.

  2. Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tDataprepRun and tLogRow.

  3. Connect the three components using Row > Main links.

Setting up Spark connection

  1. Click Run to open its view and then click the Spark Configuration tab to display its view for configuring the Spark connection.

    This view looks like the image below:

  2. Select the type of the Spark cluster you need to connect to.

    • Local: the Studio builds the Spark environment in itself at runtime to run the Job locally within the Studio. With this mode, each processor of the local machine is used as a Spark worker to perform the computations. This mode requires minimum parameters to be set in this configuration view.

      Note this local machine is the machine in which the Job is actually run. The Local mode is the default mode and you need to clear its check box to display the drop-down list for you to select the other modes.

    • Standalone: the Studio connects to a Spark-enabled cluster to run the Job from this cluster.

    • Yarn client: the Studio runs the Spark driver to orchestrate how the Job should be performed and then send the orchestration to the Yarn service of a given Hadoop cluster so that the Resource Manager of this Yarn service requests execution resources accordingly.

  3. If you are using the Yarn client mode, the Property type list is displayed to allow you to select an established Hadoop connection from the Repository, on the condition that you have created this connection in the Repository. Then the Studio will reuse that set of connection information for this Job.

    For further information about how to create an Hadoop connection in Repository, see the chapter describing the Hadoop cluster node of the Talend Studio User Guide.

  4. Select the version of the Hadoop distribution you are using along with Spark.

    • If you select Microsoft HD Insight 3.4, you need to configure the connections to the Livy service, the HD Insight service and the Windows Azure Storage service of that cluster in the areas that are displayed. A demonstration video about how to configure a connection to Microsoft HD Insight cluster is available in the following link: https://www.youtube.com/watch?v=A3QTT6VsNoM.

      The configuration of Livy is not presented in this video. The Hostname of Livy uses the following syntax: your_spark_cluster_name.azurehdinsight.net. For further information about the Livy service used by HD Insight, see Submit Spark jobs using Livy.

    • If you select Amazon EMR, see the article Amazon EMR - Getting Started on about how to configure the connection on Talend Help Center (https://help.talend.com).

      It is recommended to install your Talend Jobserver in the EMR cluster. For further information about this Jobserver, see Talend Installation Guide.

    If you cannot find the distribution corresponding to yours from this drop-down list, this means the distribution you want to connect to is not officially supported by Talend. In this situation, you can select Custom, then select the Spark version of the cluster to be connected and click the button to display the dialog box in which you can alternatively:

    1. Select Import from existing version to import an officially supported distribution as base and then add other required jar files which the base distribution does not provide.

    2. Select Import from zip to import the configuration zip for the custom distribution to be used. This zip file should contain the libraries of the different Hadoop/Spark elements and the index file of these libraries.

      Note that custom versions are not officially supported by Talend. Talend and its community provide you with the opportunity to connect to custom versions from the Studio but cannot guarantee that the configuration of whichever version you choose will be easy. As such, you should only attempt to set up such a connection if you have sufficient Hadoop and Spark experience to handle any issues on your own.

  5. Configure the connection information to the principal services of the cluster to be used.

    If you are using the Yarn client mode, you need to enter the addresses of the following different services in their corresponding fields (if you leave the check box of a service clear, then at runtime, the configuration about this parameter in the Hadoop cluster to be used will be ignored ):

    • In the Resource manager field, enter the address of the ResourceManager service of the Hadoop cluster to be used.

    • Select the Set resourcemanager scheduler address check box and enter the Scheduler address in the field that appears.

    • Select the Set jobhistory address check box and enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server.

    • Select the Set staging directory check box and enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.

    • If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal names for the ResourceManager service and the JobHistory service in the displayed fields. This enables you to use your user name to authenticate against the credentials stored in Kerberos. These principals can be found in the configuration files of your distribution. For example, in a CDH4 distribution, the Resource manager principal is set in the yarn-site.xml file and the Job history principal in the mapred-site.xml file.

      • If this cluster is a MapR cluster of the version 4.0.1 or later, you can set the MapR ticket authentication configuration in addition or as an alternative by following the explanation in Connecting to a security-enabled MapR.

        Keep in mind that this configuration generates a new MapR security ticket for the username defined in the Job in each execution. If you need to reuse an existing ticket issued for the same username, leave both the Force MapR ticket authentication check box and the Use Kerberos authentication check box clear, and then MapR should be able to automatically find that ticket on the fly.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field.

      Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.

    • The User name field is available when you are not using Kerberos to authenticate. In the User name field, enter the login user name for your distribution. If you leave it empty, the user name of the machine hosting the Studio will be used.

      Since the Job needs to upload jar files to HDFS of the cluster to be used, you must ensure that this user name is the same as the one you have put in tHDFSConfiguration, the component used to provides HDFS connection information to Spark.

    If you are using the Standalone mode, you need to set the following parameters:

    • In the Spark host field, enter the URI of the Spark Master of the Hadoop cluster to be used.

    • In the Spark home field, enter the location of the Spark executable installed in the Hadoop cluster to be used.

  6. If you need to run the current Job on Windows, it is recommended to specify where the winutils.exe program to be used is stored.

    • If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box and enter the directory where your winutils.exe is stored.

    • Otherwise, leave this check box clear, the Studio generates one by itself and automatically uses it for this Job.

  7. If the Spark cluster cannot recognize the machine in which the Job is launched, select this Define the driver hostname or IP address check box and enter the host name or the IP address of this machine. This allows the Spark master and its workers to recognize this machine to find the Job and thus its driver.

    Note that in this situation, you also need to add the name and the IP address of this machine to its host file.

  8. If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the field that is displayed, enter the directory in which Spark stores, in the file system of the cluster, the context data of the streaming computation such as the metadata and the generated RDDs of this computation.

    For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#checkpointing.

  9. Select the Set Tuning properties check box to optimize the allocation of the resources to be used to run this Job. These properties are not mandatory for the Job to run successfully, but they are useful when Spark is bottlenecked by any resource issue in the cluster such as CPU, bandwidth or memory:

    • Driver memory and Driver core: enter the allocation size of memory and the number of cores to be used by the driver of the current Job.

    • Executor memory: enter the allocation size of memory to be used by each Spark executor.

    • Set executor memory: select this check box and in the field that is displayed, enter the amount of off-heap memory (in MB) to be allocated per executor. This is actually the spark.yarn.executor.memoryOverhead property.

    • Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode.

    • Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use.

    • Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine.

    • Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used.

    • Yarn resource allocation: select how you want Yarn to allocate resources among executors.

      • Auto: you let Yarn use its default number of executors. This number is 2.

      • Fixed: you need to enter the number of executors to be used in the Num executors that is displayed.

      • Dynamic: Yarn adapts the number of executors to suit the workload. You need to define the scale of this dynamic allocation by defining the initial number of executors to run in the Initial executors field, the lowest number of executors in the Min executors field and the largest number of executors in the Max executors field.

      This feature is available to the Yarn client mode only.

  10. In the Spark "scratch" directory field, enter the directory in which the Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

  11. In the Yarn client mode, you can enable the Spark application logs of this Job to be persistent in the file system. To do this, select the Enable Spark event logging check box.

    The parameters relevant to Spark logs are displayed:

    • Spark event logs directory: enter the directory in which Spark events are logged. This is actually the spark.eventLog.dir property.

    • Spark history server address: enter the location of the history server. This is actually the spark.yarn.historyServer.address property.

    • Compress Spark event logs: if needs be, select this check box to compress the logs. This is actually the spark.eventLog.compress property.

    Since the administrator of your cluster could have defined these properties in the cluster configuration files, it is recommended to contact the administrator for the exact values.

  12. In the Advanced properties table, add any Spark properties you need to use to override their default counterparts used by the Studio.

    The advanced properties required by different Hadoop distributions and their values are listed below:

    • Hortonworks Data Platform V2.4:

      • spark.yarn.am.extraJavaOptions: -Dhdp.version=2.4.0.0-169

      • spark.driver.extraJavaOptions: -Dhdp.version=2.4.0.0-169

      In addition, you need to add -Dhdp.version=2.4.0.0-169 to the JVM settings area either in the Advanced settings tab of the Run view or in the Talend > Run/Debug view of the [Preferences] window. Setting this argument in the [Preferences] window applies it on all the Jobs that are designed in the same Studio.

    • MapR V5.1 and V5.2 when the cluster is used with the HBase or the MapRDB components:

      • spark.hadoop.yarn.application.classpath: enter the value of this parameter specific to your cluster and add, if missing, the classpath to HBase to ensure that the Job to be used can find the required classes and packages in the cluster.

        For example, if the HBase version installed in the cluster is 1.1.1, copy and paste all the paths defined with the spark.hadoop.yarn.application.classpath parameter from your cluster and then add opt/mapr/hbase/hbase-1.1.1/lib/* and /opt/mapr/lib/* to these paths, separating each path with a coma(,). The added paths is where HBase is usually installed in a MapR cluster. If your HBase is installed elsewhere, contact the administrator of your cluster for details and adapt these paths accordingly.

        For a step-by-step explanation about how to add this parameter, see the documentation HBase/MapR-DB Job cannot successfully run with MapR 5.1 or 5.2 on Talend Help Center.

    For further information about the valid Spark properties, see Spark documentation at https://spark.apache.org/docs/latest/configuration.

  13. If you are using Cloudera V5.5+, you can select the Use Cloudera Navigator check box to enable the Cloudera Navigator of your distribution to trace your Job lineage to the component level, including the schema changes between components.

    With this option activated, you need to set the following parameters:

    • Username and Password: this is the credentials you use to connect to your Cloudera Navigator.

    • Cloudera Navigator URL : enter the location of the Cloudera Navigator to be connected to.

    • Cloudera Navigator Metadata URL: enter the location of the Navigator Metadata.

    • Activate the autocommit option: select this check box to make Cloudera Navigator generate the lineage of the current Job at the end of the execution of this Job.

      Since this option actually forces Cloudera Navigator to generate lineages of all its available entities such as HDFS files and directories, Hive queries or Pig scripts, it is not recommended for the production environment because it will slow the Job.

    • Kill the job if Cloudera Navigator fails: select this check box to stop the execution of the Job when the connection to your Cloudera Navigator fails.

      Otherwise, leave it clear to allow your Job to continue to run.

    • Disable SSL validation: select this check box to make your Job to connect to Cloudera Navigator without the SSL validation process.

      This feature is meant to facilitate the test of your Job but is not recommended to be used in a production cluster.

  14. If you are using Hortonworks Data Platform V2.4.0 onwards and you have installed Atlas in your cluster, you can select the Use Atlas check box to enable Job lineage to the component level, including the schema changes between components.

    With this option activated, you need to set the following parameters:

    • Atlas URL : enter the location of the Atlas to be connected to. It is often http://name_of_your_atlas_node:port

    • In the Username field and the Password field, enter the authentication information for access to Atlas.

    • Set Atlas configuration folder : if your Atlas cluster contains custom properties such as SSL or read timeout, select this check box, and in the displayed field, enter a directory in your local machine, then place the atlas-application.properties file of your Atlas in this directory. This way, your Job is enabled to use these custom properties.

      You need to ask the administrator of your cluster for this configuration file. For further information about this file, see the Client Configs section in Atlas configuration.

    • Die on error: select this check box to stop the Job execution when Atlas-related issues occur, such as connection issues to Atlas.

      Otherwise, leave it clear to allow your Job to continue to run.

    If you are using Hortonworks Data Platform V2.4, the Studio supports Atlas 0.5 only; if you are using Hortonworks Data Platform.V2.5, the Studio supports Atlas 0.7 only.

Configuring the connection to the file system to be used by Spark

  1. Double-click tHDFSConfiguration to open its Component view. Note that tHDFSConfiguration is used because the Spark Yarn client mode is used to run Spark Jobs in this scenario.

    Spark uses this component to connect to the HDFS system to which the jar files dependent on the Job are transferred.

  2. In the Version area, select the Hadoop distribution you need to connect to and its version.

  3. In the NameNode URI field, enter the location of the machine hosting the NameNode service of the cluster.

  4. In the Username field, enter the authentication information used to connect to the HDFS system to be used. Note that the user name must be the same as you have put in the Spark configuration tab.

Configuring the input data and the preparation

Loading the sample data

  1. In the design workspace, select the tFixedFlowInput component and click the Component tab to define its basic settings.

  2. Click the [...] button next to Edit schema to open the schema editor.

  3. Click the [+] button to add the schema columns as shown in this image.