In this scenario, you create a Spark Batch Job to write data about some movie directors into the MongoDB default database and then read the data from this database.
The sample data about movie directors reads as follows:
1;Gregg Araki 2;P.J. Hogan 3;Alan Rudolph 4;Alex Proyas 5;Alex Sichel
This data contains the names of these directors and the ID numbers distributed to them.
Note that the sample data is created for demonstration purposes only.
Prerequisite: ensure that the Spark cluster and the MongoDB database to be used have been properly installed and are running.
To replicate this scenario, proceed as follows:
In the Integration perspective of the Studio, create an empty Spark Batch Job from the Job Designs node in the Repository tree view.
For further information about how to create a Spark Batch Job, see Talend Big Data Getting Started Guide.
In the workspace, enter the name of the component to be used and select this component from the list that appears. In this scenario, the components are tHDFSConfiguration, tMongoDBConfiguration, tFixedFlowInput, tMongoDBOutput, tMongoDBInput and tLogRow.
The tFixedFlowInput components are used to load the sample data into the data flow. In the real-world practice, you can use other components such as tFileInputDelimited, alone or even with a tMap, in the place of tFixedFlowInput to design a sophisticated process to prepare your data to be processed.
Connect tFixedFlowInput to tMongoDBOutput using the Row > Main link.
Connect tMongoDBInput to tLogRow using the Row > Main link.
Connect tFixedFlowInput to tMongoDBInput using the Trigger > OnSubjobOk link.
Leave tHDFSConfiguration and tMongoDBConfiguration alone without any connection.
Click Run to open its view and then click the Spark Configuration tab to display its view for configuring the Spark connection.
This view looks like the image below:
Select the type of the Spark cluster you need to connect to.
Local: the Studio builds the Spark environment in itself at runtime to run the Job locally within the Studio. With this mode, each processor of the local machine is used as a Spark worker to perform the computations. This mode requires minimum parameters to be set in this configuration view.
Note this local machine is the machine in which the Job is actually run. The Local mode is the default mode and you need to clear its check box to display the drop-down list for you to select the other modes.
Standalone: the Studio connects to a Spark-enabled cluster to run the Job from this cluster.
Yarn client: the Studio runs the Spark driver to orchestrate how the Job should be performed and then send the orchestration to the Yarn service of a given Hadoop cluster so that the Resource Manager of this Yarn service requests execution resources accordingly.
If you are using the Yarn client mode, the Property type list is displayed to allow you to select an established Hadoop connection from the Repository, on the condition that you have created this connection in the Repository. Then the Studio will reuse that set of connection information for this Job.
For further information about how to create an Hadoop connection in Repository, see the chapter describing the Hadoop cluster node of the Talend Studio User Guide.
Select the version of the Hadoop distribution you are using along with Spark.
If you select Microsoft HD Insight 3.4, you need to configure the connections to the Livy service, the HD Insight service and the Windows Azure Storage service of that cluster in the areas that are displayed. A demonstration video about how to configure a connection to Microsoft HD Insight cluster is available in the following link: https://www.youtube.com/watch?v=A3QTT6VsNoM.
The configuration of Livy is not presented in this video. The Hostname of Livy uses the following syntax: your_spark_cluster_name.azurehdinsight.net. For further information about the Livy service used by HD Insight, see Submit Spark jobs using Livy.
If you select Amazon EMR, see the article Amazon EMR - Getting Started on about how to configure the connection on Talend Help Center (https://help.talend.com).
It is recommended to install your Talend Jobserver in the EMR cluster. For further information about this Jobserver, see Talend Installation Guide.
If you cannot find the distribution corresponding to yours from this drop-down list, this means the distribution you want to connect to is not officially supported by Talend. In this situation, you can select Custom, then select the Spark version of the cluster to be connected and click the button to display the dialog box in which you can alternatively:
Select Import from existing version to import an officially supported distribution as base and then add other required jar files which the base distribution does not provide.
Select Import from zip to import the configuration zip for the custom distribution to be used. This zip file should contain the libraries of the different Hadoop/Spark elements and the index file of these libraries.
Note that custom versions are not officially supported by Talend. Talend and its community provide you with the opportunity to connect to custom versions from the Studio but cannot guarantee that the configuration of whichever version you choose will be easy. As such, you should only attempt to set up such a connection if you have sufficient Hadoop and Spark experience to handle any issues on your own.
Configure the connection information to the principal services of the cluster to be used.
If you are using the Yarn client mode, you need to enter the addresses of the following different services in their corresponding fields (if you leave the check box of a service clear, then at runtime, the configuration about this parameter in the Hadoop cluster to be used will be ignored ):
In the Resource manager field, enter the address of the ResourceManager service of the Hadoop cluster to be used.
Select the Set resourcemanager scheduler address check box and enter the Scheduler address in the field that appears.
Select the Set jobhistory address check box and enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server.
Select the Set staging directory check box and enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.
If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal names for the ResourceManager service and the JobHistory service in the displayed fields. This enables you to use your user name to authenticate against the credentials stored in Kerberos. These principals can be found in the configuration files of your distribution. For example, in a CDH4 distribution, the Resource manager principal is set in the yarn-site.xml file and the Job history principal in the mapred-site.xml file.
If this cluster is a MapR cluster of the version 4.0.1 or later, you can set the MapR ticket authentication configuration in addition or as an alternative by following the explanation in Connecting to a security-enabled MapR.
Keep in mind that this configuration generates a new MapR security ticket for the username defined in the Job in each execution. If you need to reuse an existing ticket issued for the same username, leave both the Force MapR ticket authentication check box and the Use Kerberos authentication check box clear, and then MapR should be able to automatically find that ticket on the fly.
If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field.
Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.
The User name field is available when you are not using Kerberos to authenticate. In the User name field, enter the login user name for your distribution. If you leave it empty, the user name of the machine hosting the Studio will be used.
Since the Job needs to upload jar files to HDFS of the cluster to be used, you must ensure that this user name is the same as the one you have put in tHDFSConfiguration, the component used to provides HDFS connection information to Spark.
If you are using the Standalone mode, you need to set the following parameters:
In the Spark host field, enter the URI of the Spark Master of the Hadoop cluster to be used.
In the Spark home field, enter the location of the Spark executable installed in the Hadoop cluster to be used.
If you need to run the current Job on Windows, it is recommended to specify where the winutils.exe program to be used is stored.
If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box and enter the directory where your winutils.exe is stored.
Otherwise, leave this check box clear, the Studio generates one by itself and automatically uses it for this Job.
If the Spark cluster cannot recognize the machine in which the Job is launched, select this Define the driver hostname or IP address check box and enter the host name or the IP address of this machine. This allows the Spark master and its workers to recognize this machine to find the Job and thus its driver.
Note that in this situation, you also need to add the name and the IP address of this machine to its host file.
If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the field that is displayed, enter the directory in which Spark stores, in the file system of the cluster, the context data of the streaming computation such as the metadata and the generated RDDs of this computation.
For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#checkpointing.
Select the Set Tuning properties check box to optimize the allocation of the resources to be used to run this Job. These properties are not mandatory for the Job to run successfully, but they are useful when Spark is bottlenecked by any resource issue in the cluster such as CPU, bandwidth or memory:
Driver memory and Driver core: enter the allocation size of memory and the number of cores to be used by the driver of the current Job.
Executor memory: enter the allocation size of memory to be used by each Spark executor.
Set executor memory: select this check box and in the field that is displayed, enter the amount of off-heap memory (in MB) to be allocated per executor. This is actually the spark.yarn.executor.memoryOverhead property.
Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode.
Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use.
Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine.
Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used.
Yarn resource allocation: select how you want Yarn to allocate resources among executors.
Auto: you let Yarn use its default number of executors. This number is 2.
Fixed: you need to enter the number of executors to be used in the Num executors that is displayed.
Dynamic: Yarn adapts the number of executors to suit the workload. You need to define the scale of this dynamic allocation by defining the initial number of executors to run in the Initial executors field, the lowest number of executors in the Min executors field and the largest number of executors in the Max executors field.
This feature is available to the Yarn client mode only.
In the Spark "scratch" directory field, enter the directory in which the Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
In the Yarn client mode, you can enable the Spark application logs of this Job to be persistent in the file system. To do this, select the Enable Spark event logging check box.
The parameters relevant to Spark logs are displayed:
Spark event logs directory: enter the directory in which Spark events are logged. This is actually the spark.eventLog.dir property.
Spark history server address: enter the location of the history server. This is actually the spark.yarn.historyServer.address property.
Compress Spark event logs: if needs be, select this check box to compress the logs. This is actually the spark.eventLog.compress property.
Since the administrator of your cluster could have defined these properties in the cluster configuration files, it is recommended to contact the administrator for the exact values.
In the Advanced properties table, add any Spark properties you need to use to override their default counterparts used by the Studio.
The advanced properties required by different Hadoop distributions and their values are listed below:
Hortonworks Data Platform V2.4:
In addition, you need to add -Dhdp.version=18.104.22.168-169 to the JVM settings area either in the Advanced settings tab of the Run view or in the Talend > Run/Debug view of the [Preferences] window. Setting this argument in the [Preferences] window applies it on all the Jobs that are designed in the same Studio.
MapR V5.1 and V5.2 when the cluster is used with the HBase or the MapRDB components:
spark.hadoop.yarn.application.classpath: enter the value of this parameter specific to your cluster and add, if missing, the classpath to HBase to ensure that the Job to be used can find the required classes and packages in the cluster.
For example, if the HBase version installed in the cluster is 1.1.1, copy and paste all the paths defined with the spark.hadoop.yarn.application.classpath parameter from your cluster and then add opt/mapr/hbase/hbase-1.1.1/lib/* and /opt/mapr/lib/* to these paths, separating each path with a coma(,). The added paths is where HBase is usually installed in a MapR cluster. If your HBase is installed elsewhere, contact the administrator of your cluster for details and adapt these paths accordingly.
For a step-by-step explanation about how to add this parameter, see the documentation HBase/MapR-DB Job cannot successfully run with MapR 5.1 or 5.2 on Talend Help Center.
For further information about the valid Spark properties, see Spark documentation at https://spark.apache.org/docs/latest/configuration.
If you are using Cloudera V5.5+, you can select the Use Cloudera Navigator check box to enable the Cloudera Navigator of your distribution to trace your Job lineage to the component level, including the schema changes between components.
With this option activated, you need to set the following parameters:
Username and Password: this is the credentials you use to connect to your Cloudera Navigator.
Cloudera Navigator URL : enter the location of the Cloudera Navigator to be connected to.
Cloudera Navigator Metadata URL: enter the location of the Navigator Metadata.
Activate the autocommit option: select this check box to make Cloudera Navigator generate the lineage of the current Job at the end of the execution of this Job.
Since this option actually forces Cloudera Navigator to generate lineages of all its available entities such as HDFS files and directories, Hive queries or Pig scripts, it is not recommended for the production environment because it will slow the Job.
Kill the job if Cloudera Navigator fails: select this check box to stop the execution of the Job when the connection to your Cloudera Navigator fails.
Otherwise, leave it clear to allow your Job to continue to run.
Disable SSL validation: select this check box to make your Job to connect to Cloudera Navigator without the SSL validation process.
This feature is meant to facilitate the test of your Job but is not recommended to be used in a production cluster.
If you are using Hortonworks Data Platform V2.4.0 onwards and you have installed Atlas in your cluster, you can select the Use Atlas check box to enable Job lineage to the component level, including the schema changes between components.
With this option activated, you need to set the following parameters:
Atlas URL : enter the location of the Atlas to be connected to. It is often http://name_of_your_atlas_node:port
In the Username field and the Password field, enter the authentication information for access to Atlas.
Set Atlas configuration folder : if your Atlas cluster contains custom properties such as SSL or read timeout, select this check box, and in the displayed field, enter a directory in your local machine, then place the atlas-application.properties file of your Atlas in this directory. This way, your Job is enabled to use these custom properties.
You need to ask the administrator of your cluster for this configuration file. For further information about this file, see the Client Configs section in Atlas configuration.
Die on error: select this check box to stop the Job execution when Atlas-related issues occur, such as connection issues to Atlas.
Otherwise, leave it clear to allow your Job to continue to run.
If you are using Hortonworks Data Platform V2.4, the Studio supports Atlas 0.5 only; if you are using Hortonworks Data Platform.V2.5, the Studio supports Atlas 0.7 only.
Double-click tHDFSConfiguration to open its Component view. Note that tHDFSConfiguration is used because the Spark Yarn client mode is used to run Spark Jobs in this scenario.
Spark uses this component to connect to the HDFS system to which the jar files dependent on the Job are transferred.
In the Version area, select the Hadoop distribution you need to connect to and its version.
In the NameNode URI field, enter the location of the machine hosting the NameNode service of the cluster.
In the Username field, enter the authentication information used to connect to the HDFS system to be used. Note that the user name must be the same as you have put in the Spark configuration tab.
Double-click tMongoDBConfiguration to open its Component view.
From the DB Version list, select the version of the MongoDB database to be used.
In the Server field and the Port field, enter corresponding information of the MongoDB database.
In the Database field, enter the name of the database. This database must already exist.
Double-click the tFixedFlowIput component to open its Component view.
Click the [...] button next to Edit schema to open the schema editor.
Click the [+] button to add the schema columns as shown in this image.
Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.
In the Mode area, select the Use Inline Content radio button and paste the above-mentioned sample data about movie directors into the Content field that is displayed.
In the Field separator field, enter a semicolon (;).
Double-click tMongoDBOutput to open its Component view.
If this component does not have the same schema of the preceding component, a warning icon appears. In this situation, click the Sync columns button to retrieve the schema from the preceding one and once done, the warning icon disappears.
In the Collection field, enter the name of the collection to which you need to write data. If this collection does not exist, it will be automatically created at runtime.
From the Action on data list, select the operation to be performed on the data. In this example, select Insert, which creates documents in MongoDB whether these documents already exist or not and in either case, generates a new technical ID for each of the new documents.
In the Mapping table, the id and the name columns have been automatically added. You need to define how the data from these two columns should be transformed into a hierarchical construct in MongoDB.
In this example, enter, within double quotation marks, person in the Parent node path column for each row. This way, each director record is added to a node called person. If you leave this Parent node path column empty, these records are added to the root of each document.
Double-click tMongoDBInput to open its Component view.
Click the [...] button next to Edit schema to open the schema editor.
Click the [+] button to add the schema columns for output as shown in this image.
If you want to extract the technical ID of each document, add a column called _id to the schema. In this example, this column is added. These technical IDs were generated at random by MongoDB when the sample data was written to the database.
In the Collection field, enter the name of the collection from which you need to read data. In this example, it is the director one used previously in tMongoDBOutput.
In the Mapping table, the three output columns have been automatically added. You need to add the parent nodes they belong to in the MongoDB documents. In this example, enter, within double quotation marks, person in the Parent node path column for the id and the name columns and leave the _id column as is, meaning that the _id field is at the root of each document.
The tMongDBInput component parses the extracted documents according to this mapping and writes the data in the corresponding columns.
Then you can run this Job.
The tLogRow component is used to present the execution result of the Job.
Double-click the tLogRow component to open the Component view.
Select the Table radio button to present the result in a table.
Press F6 to run this Job.
Once done, in the console of the Run view, you can check the execution result.
Note that you can manage the level of the execution information to be outputted in this console by selecting the log4jLevel check box in the Advanced settings tab and then selecting the level of the information you want to display.
For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.