Extracting the hashtag field from the raw Tweet data - Cloud

Extracting the hashtag field from the raw Tweet data - Cloud - 8.0

Kafka

Version

Cloud

8.0

Language

English

Product

Talend Big Data

Talend Big Data Platform

Talend Data Fabric

Talend Real-Time Big Data Platform

Module

Talend Studio

Content

Data Governance > Third-party systems > Messaging components (Integration) > Kafka components

Data Quality and Preparation > Third-party systems > Messaging components (Integration) > Kafka components

Design and Development > Third-party systems > Messaging components (Integration) > Kafka components

Last publication date

2024-02-29

Double-click tExtractJSONFields to open its Component view.

As you can read from https://dev.twitter.com/overview/api/entities-in-twitter-objects#hashtags, the raw Tweet data uses the JSON format.
Click Sync columns to retrieve the schema from its preceding component. This is actually the read-only schema of tKafkaInput, since tWindow does not impact the schema.
Click the [...] button next to Edit schema to open the schema editor.
Rename the single column of the output schema to hashtag. This column is used to carry the hashtag field extracted from the Tweet JSON data.
Click OK to validate these changes.
From the Read by list, select JsonPath.
From the JSON field list, select the column of the input schema from which you need to extract fields. In this scenario, it is payload.
In the Loop Jsonpath query field, enter JSON path pointing to the element over which extraction is looped. According to the JSON structure of a Tweet as you can read from the documentation of Twitter, enter $.entities.hashtags to loop over the hashtags entity.
In the Mapping table, in which the hashtag column of the output schema has been filled in automatically, enter the element on which the extraction is performed. In this example, this is the text attribute of each hashtags entity. Therefore, enter text within double quotation marks in the Json query column.