Analyzing a Twitter flow in near real-time - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

In this scenario, you create a Spark Streaming Job to analyze, at the end of each 15-second interval, which hashtags are most used by Twitter users when they mention Paris in their Tweets over the previous 20 seconds.

An open source third-party program is used to receive and write Twitter streams in a given Kafka topic, twitter_live for example, and the Job you design in this scenario is used to consume the Tweets from this topic.

A row of Twitter raw data with hashtags reads like the example presented at https://dev.twitter.com/overview/api/entities-in-twitter-objects#hashtags.

Before replicating this scenario, you need to ensure that your Kafka system is up and running and you have proper rights and permissions to access the Kafka topic to be used. You also need a Twitter-streaming program to transfer Twitter streams into Kafka in near real-time. Talend does not provide this kind of program but some free programs created for this purpose are available in some online communities such as Github.

To replicate this scenario, proceed as follows:

Linking the components

  1. In the Integration perspective of the Studio, create an empty Spark Streaming Job from the Job Designs node in the Repository tree view.

    For further information about how to create a Spark Streaming Job, see Talend Big Data Getting Started Guide.

  2. In the workspace, enter the name of the component to be used and select this component from the list that appears. In this scenario, the components are tHDFSConfiguration, tKafkaInput, tWindow, tExtractJSONFields, tMap, tAggregateRow, tTop and tLogRow.

  3. Connect tKafkaInput, tWindow, tExtractJSONFields and tMap using the Row > Main link.

  4. Connect tMap to tAggregateRow using the Row > Main link and name this connection in the dialog box that is displayed. For example, name it to hashtag.

  5. Connect tAggregateRow, tTop and tLogRow using the Row > Main link.

  6. Leave the tHDFSConfiguration component alone without any connection.

Setting up Spark connection

  1. Click Run to open its view and then click the Spark Configuration tab to display its view for configuring the Spark connection.

    This view looks like the image below:

  2. Select the type of the Spark cluster you need to connect to.

    • Local: the Studio builds the Spark environment in itself at runtime to run the Job locally within the Studio. With this mode, each processor of the local machine is used as a Spark worker to perform the computations. This mode requires minimum parameters to be set in this configuration view.

      Note this local machine is the machine in which the Job is actually run. The Local mode is the default mode and you need to clear its check box to display the drop-down list for you to select the other modes.

    • Standalone: the Studio connects to a Spark-enabled cluster to run the Job from this cluster.

    • Yarn client: the Studio runs the Spark driver to orchestrate how the Job should be performed and then send the orchestration to the Yarn service of a given Hadoop cluster so that the Resource Manager of this Yarn service requests execution resources accordingly.

  3. If you are using the Yarn client mode, the Property type list is displayed to allow you to select an established Hadoop connection from the Repository, on the condition that you have created this connection in the Repository. Then the Studio will reuse that set of connection information for this Job.

    For further information about how to create an Hadoop connection in Repository, see the chapter describing the Hadoop cluster node of the Talend Studio User Guide.

  4. Select the version of the Hadoop distribution you are using along with Spark.

    If you cannot find the distribution corresponding to yours from this drop-down list, this means the distribution you want to connect to is not officially supported by Talend. In this situation, you can select Custom, then select the Spark version of the cluster to be connected and click the button to display the dialog box in which you can alternatively:

    1. Select Import from existing version to import an officially supported distribution as base and then add other required jar files which the base distribution does not provide.

    2. Select Import from zip to import the configuration zip for the custom distribution to be used. This zip file should contain the libraries of the different Hadoop/Spark elements and the index file of these libraries.

      Note that custom versions are not officially supported by Talend. Talend and its community provide you with the opportunity to connect to custom versions from the Studio but cannot guarantee that the configuration of whichever version you choose will be easy. As such, you should only attempt to set up such a connection if you have sufficient Hadoop and Spark experience to handle any issues on your own.

  5. Configure the connection information to the principal services of the cluster to be used.

    If you are using the Yarn client mode, you need to enter the addresses of the following different services in their corresponding fields (if you leave the check box of a service clear, then at runtime, the configuration about this parameter in the Hadoop cluster to be used will be ignored ):

    • In the Resource manager field, enter the address of the ResourceManager service of the Hadoop cluster to be used.

    • Select the Set resourcemanager scheduler address check box and enter the Scheduler address in the field that appears.

    • Select the Set jobhistory address check box and enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server.

    • Select the Set staging directory check box and enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.

    • Select the Set memory check box to allocate proper memory volumes to the Map and the Reduce computations and the ApplicationMaster of YARN.

    • If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal names for the ResourceManager service and the JobHistory service in the displayed fields. This enables you to use your user name to authenticate against the credentials stored in Kerberos. These principals can be found in the configuration files of your distribution. For example, in a CDH4 distribution, the Resource manager principal is set in the yarn-site.xml file and the Job history principal in the mapred-site.xml file.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field.

      Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.

    • The User name field is available when you are not using Kerberos to authenticate. In the User name field, enter the login user name for your distribution. If you leave it empty, the user name of the machine hosting the Studio will be used.

      Since the Job needs to upload jar files to HDFS of the cluster to be used, you must ensure that this user name is the same as the one you have put in tHDFSConfiguration, the component used to provides HDFS connection information to Spark.

    If you are using the Standalone mode, you need to set the following parameters:

    • In the Spark host field, enter the URI of the Spark Master of the Hadoop cluster to be used.

    • In the Spark home field, enter the location of the Spark executable installed in the Hadoop cluster to be used.

  6. If you need to run the current Job on Windows, it is recommended to specify where the winutils.exe program to be used is stored.

    • If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box and enter the directory where your winutils.exe is stored.

    • Otherwise, leave this check box clear, the Studio generates one by itself and automatically uses it for this Job.

  7. If the Spark cluster cannot recognize the machine in which the Job is launched, select this Define the driver hostname or IP address check box and enter the host name or the IP address of this machine. This allows the Spark master and its workers to recognize this machine to find the Job and thus its driver.

    Note that in this situation, you also need to add the name and the IP address of this machine to its host file.

  8. In the Batch size field, enter the time interval at the end of which the Job reviews the source data to identify changes and processes the new micro batches.

  9. If needs be, select the Define a streaming timeout check box and in the field that is displayed, enter the time frame at the end of which the streaming Job automatically stops running.

  10. If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the field that is displayed, you need to enter the directory in which Spark stores, in the file system of the cluster, the context data of the streaming computation such as the metadata and the generated RDDs of this computation.

    For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#checkpointing.

  11. Select the Set Tuning properties check box to optimize the allocation of the resources to be used to run this Job. These properties are not mandatory for the Job to run successfully, but they are useful when Spark is bottlenecked by any resource issue in the cluster such as CPU, bandwidth or memory:

    • Driver memory and Driver core: enter the allocation size of memory and the number of cores to be used by the driver of the current Job.

    • Executor memory: enter the allocation size of memory to be used by each Spark executor.

    • Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode.

    • Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use.

    • Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine.

    • Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used.

    • Yarn resource allocation: select how you want Yarn to allocate resources among executors.

      • Auto: you leave Yarn to manage the allocation by itself.

      • Fixed: you need to enter the number of executors to be used in the Num executors that is displayed.

      • Dynamic: Yarn adapts the number of executors to suit the workload. You need to define the scale of this dynamic allocation by defining the initial number of executors to run in the Initial executors field, the lowest number of executors in the Min executors field and the largest number of executors in the Max executors field.

      This feature is available to the Yarn client mode only.

  12. In the Spark "scratch" directory field, enter the directory in which the Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

  13. Add any Spark properties you need to use to override their default counterparts used by the Studio.

    For example, if you are using the Yarn client mode with a CDH distribution, you need to specify the Yarn classpath of your cluster for the Job. The property to be added is spark.hadoop.yarn.application.classpath. Please contact the administrator of your cluster to obtain related information.

    The following value from a Cloudera cluster is presented for demonstration purposes:

    /etc/hadoop/conf,/usr/lib/hadoop/*,/usr/lib/hadoop/lib/*,/usr/lib/hadoop-hdfs/*,/usr/lib/hadoop-hdfs/lib/*,/usr/lib/hadoop-mapreduce/*,/usr/lib/hadoop-mapreduce/lib/*,/usr/lib/hadoop-yarn/*,/usr/lib/hadoop-yarn/lib/*,/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/*

    If you want the Spark application logs of this Job to be persistent in the file system, add the related properties to this Advanced properties table. For example, the properties to be set for the Yarn client mode are:

    • spark.yarn.historyServer.address

    • spark.eventLog.enabled

    • spark.eventLog.dir

    The value of the spark.eventlog.enabled property should be true; for the values of the other two properties, contact the administrator of the Spark cluster to be used.

    For further information about the valid Spark properties, see Spark documentation at https://spark.apache.org/docs/latest/configuration.

Configuring the connection to the file system to be used by Spark

  1. Double-click tHDFSConfiguration to open its Component view. Note that tHDFSConfiguration is used because the Spark Yarn client mode is used to run Spark Jobs in this scenario.

    Spark uses this component to connect to the HDFS system to which the jar files dependent on the Job are transferred.

  2. In the Version area, select the Hadoop distribution you need to connect to and its version.

  3. In the NameNode URI field, enter the location of the machine hosting the NameNode service of the cluster.

  4. In the Username field, enter the authentication information used to connect to the HDFS system to be used. Note that the user name must be the same as you have put in the Spark configuration tab.

Reading messages from a given Kafka topic

  1. Double-click tKafkaInput to open its Component view.

  2. In the Broker list field, enter the locations of the brokers of the Kafka cluster to be used, separating these locations using comma (,). In this example, only one broker exists and its location is localhost:9092.

  3. From the Starting offset drop-down list, select the starting point from which the messages of a topic are consumed. In this scenario, select From latest, meaning to start from the latest message that has been consumed by the same consumer group and of which the offset has been committed.

  4. In the Topic name field, enter the name of the topic from which this Job consumes Twitter streams. In this scenario, the topic is twitter_live.

    This topic must exist in your Kafka system. For further information about how to create a Kafka topic, see the documentation from Apache Kafka or use the tKafkaCreateTopic component provided with the Studio. But note that tKafkaCreateTopic is not available to the Spark Jobs.

  5. Select the Set number of records per second to read from each Kafka partition check box. This limits the size of each micro batch to be sent for processing.

Configuring how frequent the Tweets are analyzed

  1. Double-click tWindow to open its Component view.

    This component is used to apply a Spark window on the input RDD so that this Job always analyzes the Tweets of the last 20 seconds at the end of each 15 seconds. This creates, between every two window applications, the overlap of one micro batch, counting 5 seconds as defined in the Batch size field in the Spark configuration tab.

  2. In the Window duration field, enter 20000, meaning 20 seconds.

  3. Select the Define the slide duration check box and in the field that is displayed, enter 15000, meaning 15 seconds.

The configuration of the window is then displaed above the icon of tWindow in the Job you are designing.

Extracting the hashtag field from the raw Tweet data

  1. Double-click tExtractJSONFields to open its Component view.

    As you can read from https://dev.twitter.com/overview/api/entities-in-twitter-objects#hashtags, the raw Tweet data uses the JSON format.

  2. Click Sync columns to retrieve the schema from its preceding component. This is actually the read-only schema of tKafkaInput, since tWindow does not impact the schema.

  3. Click the [...] button next to Edit schema to open the schema editor.

  4. Rename the single column of the output schema to hashtag. This column is used to carry the hashtag field extracted from the Tweet JSON data.

  5. Click OK to validate these changes.

  6. From the Read by list, select JsonPath.

  7. From the JSON field list, select the column of the input schema from which you need to extract fields. In this scenario, it is payload.

  8. In the Loop Jsonpath query field, enter JSON path pointing to the element over which extraction is looped. According to the JSON structure of a Tweet as you can read from the documentation of Twitter, enter $.entities.hashtags to loop over the hashtags entity.

  9. In the Mapping table, in which the hashtag column of the output schema has been filled in automatically, enter the element on which the extraction is performed. In this example, this is the text attribute of each hashtags entity. Therefore, enter text within double quotation marks in the Json query column.

Aligning each hashtag to lower case

  1. Double-click tMap to open its Map editor.

  2. In the table representing the output flow (on the right side), enter StringHandling.DOWNCASE(row2.hashtag) in the Expression column. This automatically creates the map between the hashtag column of the input schema and the hashtag column of the output schema.

    Note that row2 in this expression is the ID of the input link to tMap. It can be labeled differently in the Job you are designing.

  3. Click Apply to validate these changes and click OK to close this editor.

Counting the occurrences of each hashtag

  1. Double-click tAggregateRow to open its Component view.

  2. Click the [...] button next to Edit schema to open the schema editor.

  3. On the output side, click the [+] button two times to add two rows to the output schema table and rename these new schema columns to hashtag and count, respectively.

  4. Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.

  5. In the Group by table, add one row by clicking the [+] button and select hashtag for both the Output column column and the Input column position column. This passes data from the hashtag column of the input schema to the hashtag column of the output schema.

  6. In the Operations table, add one row by clicking the [+] button.

  7. In the Output column column, select count, in the Function column, select count and in the Input column position column, select hashtag.

Selecting the 5 most used hashtags in each 20 seconds

  1. Double-click tTop to open its Component view.

  2. In the Number of line selected field, enter the number of rows to be output to the next component, counting down from the first row of the data sorted by tTop. In this example, it is 5, meaning the 5 most used hashtags in each 20 seconds.

  3. In the Criteria table, add one row by clicking the [+] button.

  4. In the Schema column column, select count, the column for which the data is sorted, in the sort num or alpha column, select num, which means the data to be sorted are numbers, and in the Order asc or desc column, select desc to arrange the data in descending order.

Executing the Job

Then you can run this Job.

The tLogRow component is used to present the execution result of the Job.

  1. Ensure that your Twitter streaming program is still running and keep writing the received Tweets into the given topic.

  2. Press F6 to run this Job.

Leave the Job running a while and then in the console of the Run view, you can read the Job is listing the 5 most used hashtags in each batch of Tweets mentioning Paris. According to the configuration of the size of each micro batch and the Spark window, each of these Tweet batches contains the last 20 seconds' worth of Tweets received at the end of each 15-second interval.

Note that you can manage the level of the execution information to be outputted in this console by selecting the log4jLevel check box in the Advanced settings tab and then selecting the level of the information you want to display.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.