Working with Amazon Kinesis and Big Data Streaming Jobs

EnrichVersion
Cloud
6.4
EnrichProdName
Talend Data Fabric
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Messaging components (Integration) > Kinesis components
Design and Development > Third-party systems > Messaging components (Integration) > Kinesis components
Data Quality and Preparation > Third-party systems > Messaging components (Integration) > Kinesis components
EnrichPlatform
Talend Studio

Working with Amazon Kinesis and Big Data Streaming Jobs

This scenario shows how to work with Amazon Kinesis and Big Data Streaming Jobs using the Spark Streaming framework.

This scenario applies only to Talend Real-time Big Data Platform or Talend Data Fabric.

This example uses Talend Real-Time Big Data Platform v6.1. In addition, it uses these licensed products provided by Amazon: Amazon EC2, Amazon Kinesis, and Amazon EMR.

In this example, you will build the following Job, to read and and write data to an Amazon Kinesis stream and display results in the Console.

Launching an Amazon Kinesis Stream

Procedure

  1. From the Amazon Web Services home page, navigate to Kinesis.
  2. Click Go to Kinesis Streams and click Create Stream.
  3. In the Stream Name field, enter the Kinesis stream name and provide the number of shards.

    For the current example, 1 shard is enough.

  4. Click Create. Then, you will reach the Kinesis stream list. Your new stream will be available when its status changes from CREATING to ACTIVE. Your stream is now ready.

Writing data to an Amazon Kinesis Stream

Before you begin

In this section, it is assumed that you have an Amazon EMR cluster up and running and that you have created the corresponding cluster connection metadata in the repository. It is also assumed that you have created an Amazon Kinesis stream.

Procedure

  1. Create a Big Data Streaming Job using the Spark framework.
  2. In this example the data, which will be written to Amazon Kinesis, are generated with a tRowGenerator component.
  3. The data must be serialized as byte arrays before being written to the Amazon Kinesis stream. Add a tWriteDelimitedFields component and connect it to the tRowGenerator component.
  4. Configure the Output type to byte[].
  5. To write the data to your Kinesis stream, add a tKinesisOutput component and connect the tWriteDelimitedFields component to it.
  6. Provide your Amazon credentials.
  7. To access your Kinesis stream, provide the Stream name and the corresponding endpoint url.

    To get the right endpoint url, refer to AWS Regions and Endpoints.

  8. Provide the number of shards, as specified when you created the Kinesis stream.

Reading Data from an Amazon Kinesis Stream

Procedure

  1. To read data from your Kinesis stream, add a tKinesisInput component and connect the tRowGenerator component to it with an InParallel trigger.
  2. In the Basic settings view of the tKinesisInput component, provide your Amazon credentials.
  3. Provide your Kinesis Stream name and the corresponding Endpoint url.
  4. Select the Explicitly set authentication parameters option and enter your Region, as mentioned in AWS Regions and Endpoints.
  5. Add a tReplicate component and connect it with tKinesisInput.

    The purpose of the tReplicate component is to have a processing component in the Job; otherwise, the execution of the Job will fail. The tReplicate component allows the Job to compile without modifying the data.

  6. Add a tExtracDelimitedFields component and connect it to the tReplicate component.

    The tExtractDelimitedFields will extract the data from the serialized message generated by the tKinesisInput component.

  7. Add a tLogRow component to display the output on the console and on its Basic settings view select Table (print values in cells of a table) to display the data in a table.

Configuring a Big Data Streaming Job using the Spark Streaming Framework

Before running your Job, you need to configure it to use your Amazon EMR cluster.

Procedure

  1. Because your Job will run on Spark, it is necessary to add a tHDFSConfiguration component and then configure it to use the HDFS connection metadata from the repository.
  2. In the Run view, click the Spark Configuration tab.
  3. In the Cluster Version panel, configure your Job to user your cluster connection metadata.
  4. Set the Batch size to 2000 ms.
  5. Because you will set some advanced properties, change the Property type to Built-In.
  6. In the Tuning panel, select the Set tuning properties option and configure the fields as follows.
  7. Run your Job.

    It takes a couple of minutes to have data displayed in the Console.