tMongoDBInput - 6.3

Talend Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Function

tMongoDBInput retrieves certain documents from a MongoDB database collection by supplying a query document containing the fields the desired documents should match.

Purpose

This component allows you to retrieve records from a collection in the MongoDB database and transfer them to the following component for display or storage.

Depending on the Talend solution you are using, this component can be used in one, some or all of the following Job frameworks:

  • Standard: see tMongoDBInput Properties.

    The component in this framework is available when you are using one of the Talend solutions with Big Data.

  • Spark Batch: see tMongoDBInput properties in Spark Batch Jobs.

    The component in this framework is available only if you have subscribed to one of the Talend solutions with Big Data.

  • Spark Streaming: see tMongoDBInput properties in Spark Streaming Jobs.

    In this type of Job, tMongDBInput is used to provide lookup data, when the size of the lookup data fits the amount of memory allocated for the execution of the Job. It is executed once to read data from MongoDB and store the data in memory so that the micro-batches from the main flow can easily access the data. If the lookup data is too large to be stored in memory, it is recommended to use tMongoDBLookupInput instead, which loads only the data matching the lookup join key.

    The component in this framework is available only if you have subscribed to Talend Real-time Big Data Platform or Talend Data Fabric.

tMongoDBInput Properties

Component family

Big Data / MongoDB

 

Basic settings

Use existing connection

Select this check box and in the Component List click the relevant connection component to reuse the connection details you already defined.

 

DB Version

List of the database versions.

Available when the Use existing connection check box is not selected.

 

Use replica set address

Select this check box to show the Replica address table.

In the Replica address table, you can define multiple MongoDB database servers for failover.

Available when the Use existing connection check box is not selected.

 

Server and Port

IP address and listening port of the database server.

Available when the Use existing connection or Use replica set address check box is not selected.

 

Database

Name of the database.

 

Use SSL connection

Select this check box to enable the SSL or TLS encrypted connection.

Then you need to use the tSetKeystore component in the same Job to specify the encryption information.

For further information about tSetKeystore, see tSetKeystore.

Note that the SSL connection is available only for the version 2.4 + of MongoDB.

 

Set read preference

Select this check box and from the Read preference drop-down list that is displayed, select the member to which you need to direct the read operations.

If you leave this check box clear, the Job uses the default Read preference, that is to say, uses the primary member in a replica set.

For further information, see MongoDB's documentation about Replication and its Read preferences.

 

Required authentication

Select this check box to enable the database authentication.

The Authentication mechanism drop-down list becomes available. Among the listed mechanisms, the NEGOTIATE one is recommended if you are not using Kerberos, because it automatically select the authentication mechanism the most adapted to the MongoDB version you are using.

For details about the other mechanisms in this list, see MongoDB Authentication from the MongoDB documentation.

 

Set Authentication database

If the username to be used to connect to MongoDB has been created in a specific Authentication database of MongoDB, select this check box to enter the name of this Authentication database in the Authentication database field that is displayed.

For further information about the MongoDB Authentication database, see User Authentication database.

Username and Password

DB user authentication data.

To enter the password, click the [...] button next to the password field, and then in the pop-up dialog box enter the password between double quotes and click OK to save the settings.

Available when the Required authentication check box is selected.

If the security system you have selected from the Authentication mechanism drop-down list is Kerberos, you need to enter the User principal, the Realm and the KDC server fields instead of the Username and the Password fields.

 

Collection

Name of the collection in the MongoDB database.

 

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

If a column in the database is a JSON document and you need to read the entire document, put an asterisk (*) in the DB column column, without quotation marks around.

 

Query

Specify the query condition. This field is available only when you have selected Find query from the Query type drop-down list.

For example, type in "{id:4}" to retrieve the record whose id is 4 from the collection specified in the Collection field.

Note

Different from the query statements required in the MongoDB client software, the query here refers to the contents inside find(), such as the query here {id:4} versus the MongoDB client query db.blog.find({id:4}).

 

Aggregation stages

Create a MongoDB aggregation pipeline by adding the stages you want the documents to pass through so as to obtain aggregated results from these documents. This table is available only when you have selected Aggregation pipeline query from the Query type drop-down list.

Only one stage is allowed per row in this Aggregation stages table and the stages are executed one by one in the order you place them in this table.

For example, if you want to aggregate documents about your customers using the $match and the $group stages, you need to add two rows to this Aggregation stages table and define the two stages as follows:

"{$match : {status : 'A'}}"
"{$group : {_id : '$cust_id', total : {$sum : '$amount'}}}"

In this aggregation, the customer documents with status A are selected; then among the selected customers, those using the same customer id are grouped and the values from the amount fields of the same customer are summed up.

For a full list of the stages you can use and their related operators, see Aggregation pipeline operators.

For further information about MongoDB aggregation pipeline, see Aggregation pipeline.

 

Mapping

Each column of the schema defined for this component represents a field of the documents to be read. In this table, you need to specify the parent nodes of these fields, if any.

For example, in the document reading as follows

{
               _id: ObjectId("5099803df3f4948bd2f98391"),
               person: { first: "Joe", last: "Walker" }
            }

The first and the last fields have person as their parent node but the _id field does not have any parent node. So once completed, this Mapping table should read as follows:

Column     Parent node path
_id
first       "person"
last        "person"
 

Sort by

Specify the column and choose the order for the sort operation.

This field is available only when you have selected Find query from the Query type drop-down list.

 

Limit

Type in the maximum number of records to be retrieved.

This field is available only when you have selected Find query from the Query type drop-down list.

Advanced settings

tStatCatcher Statistics

Select this check box to collect the log data at the component level.

 

No query timeout

Select this check box to prevent MongoDB servers from stopping idle cursors at the end of 10-minute inactivity of these cursors. In this situation, an idle cursor will stay open until either the results of this cursor are exhausted or you manually close it using the cursor.close() method.

A cursor for MongoDB is a pointer to the result set of a query. By default, that is to say, with this check box being clear, a MongoDB server automatically stops idle cursors after a given inactivity period to avoid excess memory use. For further information about MongoDB cursors, see https://docs.mongodb.org/manual/core/cursors/.

 

Enable external sort

Since the aggregation pipeline stages have a maximum memory use limit (100 megabytes) and a stage exceeding this limit will produce errors, when handling large datasets, select this check box to avoid aggregation stages exceeding this limit.

For further information about this external sort, see Large sort operation with external sort.

Global Variables

NB_LINE: the number of rows read by an input component or transferred to an output component. This is an After variable and it returns an integer.

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

As a start component, tMongoDBInput allows you to retrieve records from a collection in the MongoDB database and transfer them to the following component for display or storage.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Scenario: Retrieving data from a collection by advanced queries

In this scenario, advanced MongoDB queries are used to retrieve the post by the author Anderson.

There are such posts in the collection blog of the MongoDB database talend:

To insert data into the database, see Scenario 1: Creating a collection and writing data to it.

Linking the components

  1. Drop tMongoDBConnection, tMongoDBClose, tMongoDBInput and tLogRow onto the workspace.

  2. Link tMongoDBConnection to tMongoDBInput using the OnSubjobOk trigger.

  3. Link tMongoDBInput to tMongoDBClose using the OnSubjobOk trigger.

  4. Link tMongoDBInput to tLogRow using a Row > Main connection.

Configuring the components

  1. Double-click tMongoDBConnection to open its Basic settings view.

  2. From the DB Version list, select the MongoDB version you are using.

  3. In the Server and Port fields, enter the connection details.

  4. In the Database field, enter the name of the MongoDB database.

  5. Double-click tMongoDBInput to open its Basic settings view.

  6. Select the Use existing connection option.

  7. In the Collection field, enter the name of the collection, namely blog.

  8. Click the [...] button next to Edit schema to open the schema editor.

  9. Click the [+] button to add five columns, namely id, author, title, keywords and contents, with the type as Integer and String respectively.

  10. Click OK to close the editor.

  11. The columns now appear in the left part of the Mapping area.

  12. For columns author, title, keywords and contents, enter their parent node post so that the data can be retrieved from the correct positions.

  13. In the Query box, enter the advanced query statement to retrieve the posts whose author is Anderson:

    "{post.author : 'Anderson'}"

    This statement requires that the sub-node of post, the node author, should have the value "Anderson".

  14. Double-click tLogRow to open its Basic settings view.

    Select Table (print values in cells of a table) for a better display of the results.

Executing the Job

  1. Press Ctrl+S to save the Job.

  2. Press F6 to run the Job.

    As shown above, the post by Anderson is retrieved.

tMongoDBInput properties in Spark Batch Jobs

Component family

Databases/MongoDB

 

Basic settings

Property type

Either Built-In or Repository.

Built-In: No property data stored centrally.

Repository: Select the repository file where the properties are stored.

 

MongoDB configuration

Select this check box and in the Component List click the relevant connection component to reuse the connection details you already defined.

 

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

If a column in the database is a JSON document and you need to read the entire document, put an asterisk (*) in the DB column column, without quotation marks around.

 

Collection

Enter the name of the collection to be used.

A MongoDB collection is the equivalent of an RDBMS table and contains documents.

If the collection to be used is not sharded, it is recommended to add the mongo.input.split_size property to the Advanced Hadoop MongoDB properties table. This parameter determines how the collection is going to be partitioned and read by the Spark executors. The number of partitions of the input collection can be calculated using the following formula:

Number of partitions = Collection size in MB / mongo.input.split_size

Without this property, Spark uses the default value, 8 MB, for the partition size.

For example:

mongo.input.split_size   1

In this example, Spark dispatches 1 MB to each Spark executor in order to read the non-sharded collection in parallel. If the collection size is 10 MB, 10 executors are employed.

 

Set read preference

Select this check box and from the Read preference drop-down list that is displayed, select the member to which you need to direct the read operations.

If you leave this check box clear, the Job uses the default Read preference, that is to say, uses the primary member in a replica set.

For further information, see MongoDB's documentation about Replication and its Read preferences.

 

Query

Specify the query statement to select documents from the collection specified in the Collection field. For example, type in "{'id':'4'}" to retrieve the record whose id is 4 from the collection.

The default query, {} within double quotation marks provided with this component, means to select all of the files. You can also apply a regular expression by putting {'filename':{'$regex':'REGEX_PATTERN'}} to define the file names to be used.

Different from the query statements required in the MongoDB client software, the query here refers to the contents inside find(), such as the query {'filename':{'$regex':'REGEX_PATTERN'}} here is the equivalent of db.blog.find({filename:{$regex:REGEX_PATTERN}}) in the MongoDB client query.

 

Mapping

Each column of the schema defined for this component represents a field of the documents to be read. In this table, you need to specify the parent nodes of these fields, if any.

For example, in the document reading as follows

{
               _id: ObjectId("5099803df3f4948bd2f98391"),
               person: { first: "Joe", last: "Walker" }
            }

The first and the last fields have person as their parent node but the _id field does not have any parent node. So once completed, this Mapping table should read as follows:

Column     Parent node path
_id
first       "person"
last        "person"
 

Limit

Enter the maximum number of records to be retrieved.

Advanced settings

Advanced Hadoop MongoDB properties

Add properties to define extra operations you need tMongoDBInput to perform when reading data.

The available properties are listed and explained in MongoDB Connector for Hadoop.

If the collection to be used is not sharded, it is recommended to add the mongo.input.split_size property to the Advanced Hadoop MongoDB properties table. This parameter determines how the collection is going to be partitioned and read by the Spark executors. The number of partitions of the input collection can be calculated using the following formula:

Number of partitions = Collection size in MB / mongo.input.split_size

Without this property, Spark uses the default value, 8 MB, for the partition size.

For example:

mongo.input.split_size   1

In this example, Spark dispatches 1 MB to each Spark executor in order to read the non-sharded collection in parallel. If the collection size is 10 MB, 10 executors are employed.

Usage in Spark Batch Jobs

This component is used as a start component and requires an output link..

This component should use a tMongoDBConfiguration component present in the same Job to connect to a MongoDB database. You need to drop a tMongoDBConfiguration component alongside this component and configure the Basic settings of this component to use tMongoDBConfiguration.

This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Spark Connection

You need to use the Spark Configuration tab in the Run view to define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, one and only one file system related component from the Storage family is required in the same Job so that Spark can use this component to connect to the file system to which the jar files dependent on the Job are transferred:

This connection is effective on a per-Job basis.

Writing and reading data from MongoDB using a Spark Batch Job

In this scenario, you create a Spark Batch Job to write data about some movie directors into the MongoDB default database and then read the data from this database.

The sample data about movie directors reads as follows:

1;Gregg Araki	
2;P.J. Hogan 
3;Alan Rudolph 
4;Alex Proyas
5;Alex Sichel

This data contains the names of these directors and the ID numbers distributed to them.

Note that the sample data is created for demonstration purposes only.

Prerequisite: ensure that the Spark cluster and the MongoDB database to be used have been properly installed and are running.

To replicate this scenario, proceed as follows:

Linking the components

  1. In the Integration perspective of the Studio, create an empty Spark Batch Job from the Job Designs node in the Repository tree view.

    For further information about how to create a Spark Batch Job, see Talend Big Data Getting Started Guide.

  2. In the workspace, enter the name of the component to be used and select this component from the list that appears. In this scenario, the components are tHDFSConfiguration, tMongoDBConfiguration, tFixedFlowInput, tMongoDBOutput, tMongoDBInput and tLogRow.

    The tFixedFlowInput components are used to load the sample data into the data flow. In the real-world practice, you can use other components such as tFileInputDelimited, alone or even with a tMap, in the place of tFixedFlowInput to design a sophisticated process to prepare your data to be processed.

  3. Connect tFixedFlowInput to tMongoDBOutput using the Row > Main link.

  4. Connect tMongoDBInput to tLogRow using the Row > Main link.

  5. Connect tFixedFlowInput to tMongoDBInput using the Trigger > OnSubjobOk link.

  6. Leave tHDFSConfiguration and tMongoDBConfiguration alone without any connection.

Setting up Spark connection

  1. Click Run to open its view and then click the Spark Configuration tab to display its view for configuring the Spark connection.

    This view looks like the image below:

  2. Select the type of the Spark cluster you need to connect to.

    • Local: the Studio builds the Spark environment in itself at runtime to run the Job locally within the Studio. With this mode, each processor of the local machine is used as a Spark worker to perform the computations. This mode requires minimum parameters to be set in this configuration view.

      Note this local machine is the machine in which the Job is actually run. The Local mode is the default mode and you need to clear its check box to display the drop-down list for you to select the other modes.

    • Standalone: the Studio connects to a Spark-enabled cluster to run the Job from this cluster.

    • Yarn client: the Studio runs the Spark driver to orchestrate how the Job should be performed and then send the orchestration to the Yarn service of a given Hadoop cluster so that the Resource Manager of this Yarn service requests execution resources accordingly.

  3. If you are using the Yarn client mode, the Property type list is displayed to allow you to select an established Hadoop connection from the Repository, on the condition that you have created this connection in the Repository. Then the Studio will reuse that set of connection information for this Job.

    For further information about how to create an Hadoop connection in Repository, see the chapter describing the Hadoop cluster node of the Talend Studio User Guide.

  4. Select the version of the Hadoop distribution you are using along with Spark.

    • If you select Microsoft HD Insight 3.4, you need to configure the connections to the Livy service, the HD Insight service and the Windows Azure Storage service of that cluster in the areas that are displayed. A demonstration video about how to configure a connection to Microsoft HD Insight cluster is available in the following link: https://www.youtube.com/watch?v=A3QTT6VsNoM.

      The configuration of Livy is not presented in this video. The Hostname of Livy uses the following syntax: your_spark_cluster_name.azurehdinsight.net. For further information about the Livy service used by HD Insight, see Submit Spark jobs using Livy.

    • If you select Amazon EMR, see the article Amazon EMR - Getting Started on about how to configure the connection on Talend Help Center (https://help.talend.com).

      It is recommended to install your Talend Jobserver in the EMR cluster. For further information about this Jobserver, see Talend Installation Guide.

    If you cannot find the distribution corresponding to yours from this drop-down list, this means the distribution you want to connect to is not officially supported by Talend. In this situation, you can select Custom, then select the Spark version of the cluster to be connected and click the button to display the dialog box in which you can alternatively:

    1. Select Import from existing version to import an officially supported distribution as base and then add other required jar files which the base distribution does not provide.

    2. Select Import from zip to import the configuration zip for the custom distribution to be used. This zip file should contain the libraries of the different Hadoop/Spark elements and the index file of these libraries.

      Note that custom versions are not officially supported by Talend. Talend and its community provide you with the opportunity to connect to custom versions from the Studio but cannot guarantee that the configuration of whichever version you choose will be easy. As such, you should only attempt to set up such a connection if you have sufficient Hadoop and Spark experience to handle any issues on your own.

  5. Configure the connection information to the principal services of the cluster to be used.

    If you are using the Yarn client mode, you need to enter the addresses of the following different services in their corresponding fields (if you leave the check box of a service clear, then at runtime, the configuration about this parameter in the Hadoop cluster to be used will be ignored ):

    • In the Resource manager field, enter the address of the ResourceManager service of the Hadoop cluster to be used.

    • Select the Set resourcemanager scheduler address check box and enter the Scheduler address in the field that appears.

    • Select the Set jobhistory address check box and enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server.

    • Select the Set staging directory check box and enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.

    • If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal names for the ResourceManager service and the JobHistory service in the displayed fields. This enables you to use your user name to authenticate against the credentials stored in Kerberos. These principals can be found in the configuration files of your distribution. For example, in a CDH4 distribution, the Resource manager principal is set in the yarn-site.xml file and the Job history principal in the mapred-site.xml file.

      • If this cluster is a MapR cluster of the version 4.0.1 or later, you can set the MapR ticket authentication configuration in addition or as an alternative by following the explanation in Connecting to a security-enabled MapR.

        Keep in mind that this configuration generates a new MapR security ticket for the username defined in the Job in each execution. If you need to reuse an existing ticket issued for the same username, leave both the Force MapR ticket authentication check box and the Use Kerberos authentication check box clear, and then MapR should be able to automatically find that ticket on the fly.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field.

      Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.

    • The User name field is available when you are not using Kerberos to authenticate. In the User name field, enter the login user name for your distribution. If you leave it empty, the user name of the machine hosting the Studio will be used.

      Since the Job needs to upload jar files to HDFS of the cluster to be used, you must ensure that this user name is the same as the one you have put in tHDFSConfiguration, the component used to provides HDFS connection information to Spark.

    If you are using the Standalone mode, you need to set the following parameters:

    • In the Spark host field, enter the URI of the Spark Master of the Hadoop cluster to be used.

    • In the Spark home field, enter the location of the Spark executable installed in the Hadoop cluster to be used.

  6. If you need to run the current Job on Windows, it is recommended to specify where the winutils.exe program to be used is stored.

    • If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box and enter the directory where your winutils.exe is stored.

    • Otherwise, leave this check box clear, the Studio generates one by itself and automatically uses it for this Job.

  7. If the Spark cluster cannot recognize the machine in which the Job is launched, select this Define the driver hostname or IP address check box and enter the host name or the IP address of this machine. This allows the Spark master and its workers to recognize this machine to find the Job and thus its driver.

    Note that in this situation, you also need to add the name and the IP address of this machine to its host file.

  8. If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the field that is displayed, enter the directory in which Spark stores, in the file system of the cluster, the context data of the streaming computation such as the metadata and the generated RDDs of this computation.

    For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#checkpointing.

  9. Select the Set Tuning properties check box to optimize the allocation of the resources to be used to run this Job. These properties are not mandatory for the Job to run successfully, but they are useful when Spark is bottlenecked by any resource issue in the cluster such as CPU, bandwidth or memory:

    • Driver memory and Driver core: enter the allocation size of memory and the number of cores to be used by the driver of the current Job.

    • Executor memory: enter the allocation size of memory to be used by each Spark executor.

    • Set executor memory: select this check box and in the field that is displayed, enter the amount of off-heap memory (in MB) to be allocated per executor. This is actually the spark.yarn.executor.memoryOverhead property.

    • Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode.

    • Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use.

    • Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine.

    • Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used.

    • Yarn resource allocation: select how you want Yarn to allocate resources among executors.

      • Auto: you let Yarn use its default number of executors. This number is 2.

      • Fixed: you need to enter the number of executors to be used in the Num executors that is displayed.

      • Dynamic: Yarn adapts the number of executors to suit the workload. You need to define the scale of this dynamic allocation by defining the initial number of executors to run in the Initial executors field, the lowest number of executors in the Min executors field and the largest number of executors in the Max executors field.

      This feature is available to the Yarn client mode only.

  10. In the Spark "scratch" directory field, enter the directory in which the Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

  11. In the Yarn client mode, you can enable the Spark application logs of this Job to be persistent in the file system. To do this, select the Enable Spark event logging check box.

    The parameters relevant to Spark logs are displayed:

    • Spark event logs directory: enter the directory in which Spark events are logged. This is actually the spark.eventLog.dir property.

    • Spark history server address: enter the location of the history server. This is actually the spark.yarn.historyServer.address property.

    • Compress Spark event logs: if needs be, select this check box to compress the logs. This is actually the spark.eventLog.compress property.

    Since the administrator of your cluster could have defined these properties in the cluster configuration files, it is recommended to contact the administrator for the exact values.

  12. In the Advanced properties table, add any Spark properties you need to use to override their default counterparts used by the Studio.

    The advanced properties required by different Hadoop distributions and their values are listed below:

    • Hortonworks Data Platform V2.4:

      • spark.yarn.am.extraJavaOptions: -Dhdp.version=2.4.0.0-169

      • spark.driver.extraJavaOptions: -Dhdp.version=2.4.0.0-169

      In addition, you need to add -Dhdp.version=2.4.0.0-169 to the JVM settings area either in the Advanced settings tab of the Run view or in the Talend > Run/Debug view of the [Preferences] window. Setting this argument in the [Preferences] window applies it on all the Jobs that are designed in the same Studio.

    • MapR V5.1 and V5.2 when the cluster is used with the HBase or the MapRDB components:

      • spark.hadoop.yarn.application.classpath: enter the value of this parameter specific to your cluster and add, if missing, the classpath to HBase to ensure that the Job to be used can find the required classes and packages in the cluster.

        For example, if the HBase version installed in the cluster is 1.1.1, copy and paste all the paths defined with the spark.hadoop.yarn.application.classpath parameter from your cluster and then add opt/mapr/hbase/hbase-1.1.1/lib/* and /opt/mapr/lib/* to these paths, separating each path with a coma(,). The added paths is where HBase is usually installed in a MapR cluster. If your HBase is installed elsewhere, contact the administrator of your cluster for details and adapt these paths accordingly.

        For a step-by-step explanation about how to add this parameter, see the documentation HBase/MapR-DB Job cannot successfully run with MapR 5.1 or 5.2 on Talend Help Center.

    For further information about the valid Spark properties, see Spark documentation at https://spark.apache.org/docs/latest/configuration.

  13. If you are using Cloudera V5.5+, you can select the Use Cloudera Navigator check box to enable the Cloudera Navigator of your distribution to trace your Job lineage to the component level, including the schema changes between components.

    With this option activated, you need to set the following parameters:

    • Username and Password: this is the credentials you use to connect to your Cloudera Navigator.

    • Cloudera Navigator URL : enter the location of the Cloudera Navigator to be connected to.

    • Cloudera Navigator Metadata URL: enter the location of the Navigator Metadata.

    • Activate the autocommit option: select this check box to make Cloudera Navigator generate the lineage of the current Job at the end of the execution of this Job.

      Since this option actually forces Cloudera Navigator to generate lineages of all its available entities such as HDFS files and directories, Hive queries or Pig scripts, it is not recommended for the production environment because it will slow the Job.

    • Kill the job if Cloudera Navigator fails: select this check box to stop the execution of the Job when the connection to your Cloudera Navigator fails.

      Otherwise, leave it clear to allow your Job to continue to run.

    • Disable SSL validation: select this check box to make your Job to connect to Cloudera Navigator without the SSL validation process.

      This feature is meant to facilitate the test of your Job but is not recommended to be used in a production cluster.

  14. If you are using Hortonworks Data Platform V2.4.0 onwards and you have installed Atlas in your cluster, you can select the Use Atlas check box to enable Job lineage to the component level, including the schema changes between components.

    With this option activated, you need to set the following parameters:

    • Atlas URL : enter the location of the Atlas to be connected to. It is often http://name_of_your_atlas_node:port

    • In the Username field and the Password field, enter the authentication information for access to Atlas.

    • Set Atlas configuration folder : if your Atlas cluster contains custom properties such as SSL or read timeout, select this check box, and in the displayed field, enter a directory in your local machine, then place the atlas-application.properties file of your Atlas in this directory. This way, your Job is enabled to use these custom properties.

      You need to ask the administrator of your cluster for this configuration file. For further information about this file, see the Client Configs section in Atlas configuration.

    • Die on error: select this check box to stop the Job execution when Atlas-related issues occur, such as connection issues to Atlas.

      Otherwise, leave it clear to allow your Job to continue to run.

    If you are using Hortonworks Data Platform V2.4, the Studio supports Atlas 0.5 only; if you are using Hortonworks Data Platform.V2.5, the Studio supports Atlas 0.7 only.

Configuring the connection to the file system to be used by Spark

  1. Double-click tHDFSConfiguration to open its Component view. Note that tHDFSConfiguration is used because the Spark Yarn client mode is used to run Spark Jobs in this scenario.

    Spark uses this component to connect to the HDFS system to which the jar files dependent on the Job are transferred.

  2. In the Version area, select the Hadoop distribution you need to connect to and its version.

  3. In the NameNode URI field, enter the location of the machine hosting the NameNode service of the cluster.

  4. In the Username field, enter the authentication information used to connect to the HDFS system to be used. Note that the user name must be the same as you have put in the Spark configuration tab.

Configuring the connection to the MongoDB database to be used by Spark

  1. Double-click tMongoDBConfiguration to open its Component view.

  2. From the DB Version list, select the version of the MongoDB database to be used.

  3. In the Server field and the Port field, enter corresponding information of the MongoDB database.

  4. In the Database field, enter the name of the database. This database must already exist.

Loading the sample data

  1. Double-click the tFixedFlowIput component to open its Component view.

  2. Click the [...] button next to Edit schema to open the schema editor.

  3. Click the [+] button to add the schema columns as shown in this image.

  4. Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.

  5. In the Mode area, select the Use Inline Content radio button and paste the above-mentioned sample data about movie directors into the Content field that is displayed.

  6. In the Field separator field, enter a semicolon (;).

Writing data to MongoDB

  1. Double-click tMongoDBOutput to open its Component view.

  2. If this component does not have the same schema of the preceding component, a warning icon appears. In this situation, click the Sync columns button to retrieve the schema from the preceding one and once done, the warning icon disappears.

  3. In the Collection field, enter the name of the collection to which you need to write data. If this collection does not exist, it will be automatically created at runtime.

  4. From the Action on data list, select the operation to be performed on the data. In this example, select Insert, which creates documents in MongoDB whether these documents already exist or not and in either case, generates a new technical ID for each of the new documents.

  5. In the Mapping table, the id and the name columns have been automatically added. You need to define how the data from these two columns should be transformed into a hierarchical construct in MongoDB.

    In this example, enter, within double quotation marks, person in the Parent node path column for each row. This way, each director record is added to a node called person. If you leave this Parent node path column empty, these records are added to the root of each document.

Reading data from MongoDB

  1. Double-click tMongoDBInput to open its Component view.

  2. Click the [...] button next to Edit schema to open the schema editor.

  3. Click the [+] button to add the schema columns for output as shown in this image.

    If you want to extract the technical ID of each document, add a column called _id to the schema. In this example, this column is added. These technical IDs were generated at random by MongoDB when the sample data was written to the database.

  4. In the Collection field, enter the name of the collection from which you need to read data. In this example, it is the director one used previously in tMongoDBOutput.

  5. In the Mapping table, the three output columns have been automatically added. You need to add the parent nodes they belong to in the MongoDB documents. In this example, enter, within double quotation marks, person in the Parent node path column for the id and the name columns and leave the _id column as is, meaning that the _id field is at the root of each document.

    The tMongDBInput component parses the extracted documents according to this mapping and writes the data in the corresponding columns.

Executing the Job

Then you can run this Job.

The tLogRow component is used to present the execution result of the Job.

  1. Double-click the tLogRow component to open the Component view.

  2. Select the Table radio button to present the result in a table.

  3. Press F6 to run this Job.

Once done, in the console of the Run view, you can check the execution result.