Working with Amazon Kinesis and Big Data Streaming Jobs
This scenario applies only to Talend Real-time Big Data Platform or Talend Data Fabric.
This example uses Talend Real-Time Big Data Platform v6.1. In addition, it uses these licensed products provided by Amazon: Amazon EC2, Amazon Kinesis, and Amazon EMR.
In this example, you will build the following Job, to read and and write data to an Amazon Kinesis stream and display results in the Console.
Launching an Amazon Kinesis Stream
From the Amazon Web Services home page, navigate to
- Click Go to Kinesis Streams and click Create Stream.
In the Stream Name field, enter the Kinesis stream name
and provide the number of shards.
For the current example, 1 shard is enough.
- Click Create. Then, you will reach the Kinesis stream list. Your new stream will be available when its status changes from CREATING to ACTIVE. Your stream is now ready.
Writing data to an Amazon Kinesis Stream
Before you begin
In this section, it is assumed that you have an Amazon EMR cluster up and running and that you have created the corresponding cluster connection metadata in the repository. It is also assumed that you have created an Amazon Kinesis stream.
Create a Big Data Streaming Job using the Spark framework.
- In this example the data, which will be written to Amazon Kinesis, are generated with a tRowGenerator component.
- The data must be serialized as byte arrays before being written to the Amazon Kinesis stream. Add a tWriteDelimitedFields component and connect it to the tRowGenerator component.
- Configure the Output type to byte.
- To write the data to your Kinesis stream, add a tKinesisOutput component and connect the tWriteDelimitedFields component to it.
- Provide your Amazon credentials.
To access your Kinesis stream, provide the Stream name and the corresponding
To get the right endpoint url, refer to AWS Regions and Endpoints.
Provide the number of shards, as specified when you created the Kinesis
Reading Data from an Amazon Kinesis Stream
- To read data from your Kinesis stream, add a tKinesisInput component and connect the tRowGenerator component to it with an InParallel trigger.
- In the Basic settings view of the tKinesisInput component, provide your Amazon credentials.
- Provide your Kinesis Stream name and the corresponding Endpoint url.
- Select the Explicitly set authentication parameters option and enter your Region, as mentioned in AWS Regions and Endpoints.
Add a tReplicate component and connect it with
The purpose of the tReplicate component is to have a processing component in the Job; otherwise, the execution of the Job will fail. The tReplicate component allows the Job to compile without modifying the data.
Add a tExtracDelimitedFields component and connect it to
the tReplicate component.
The tExtractDelimitedFields will extract the data from the serialized message generated by the tKinesisInput component.
- Add a tLogRow component to display the output on the console and on its Basic settings view select Table (print values in cells of a table) to display the data in a table.
Configuring a Big Data Streaming Job using the Spark Streaming Framework
Because your Job will run on Spark, it is necessary to add a
tHDFSConfiguration component and then configure it to
use the HDFS connection metadata from the repository.
- In the Run view, click the Spark Configuration tab.
In the Cluster Version panel, configure your Job to user
your cluster connection metadata.
- Set the Batch size to 2000 ms.
- Because you will set some advanced properties, change the Property type to Built-In.
In the Tuning panel, select the Set tuning
properties option and configure the fields as follows.
Run your Job.
It takes a couple of minutes to have data displayed in the Console.