Extracting the hashtag field from the raw Tweet data - Cloud - 8.0

Kafka

Version
Cloud
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Messaging components (Integration) > Kafka components
Data Quality and Preparation > Third-party systems > Messaging components (Integration) > Kafka components
Design and Development > Third-party systems > Messaging components (Integration) > Kafka components
Last publication date
2024-02-29

Procedure

  1. Double-click tExtractJSONFields to open its Component view.
    As you can read from https://dev.twitter.com/overview/api/entities-in-twitter-objects#hashtags, the raw Tweet data uses the JSON format.
  2. Click Sync columns to retrieve the schema from its preceding component. This is actually the read-only schema of tKafkaInput, since tWindow does not impact the schema.
  3. Click the [...] button next to Edit schema to open the schema editor.
  4. Rename the single column of the output schema to hashtag. This column is used to carry the hashtag field extracted from the Tweet JSON data.
  5. Click OK to validate these changes.
  6. From the Read by list, select JsonPath.
  7. From the JSON field list, select the column of the input schema from which you need to extract fields. In this scenario, it is payload.
  8. In the Loop Jsonpath query field, enter JSON path pointing to the element over which extraction is looped. According to the JSON structure of a Tweet as you can read from the documentation of Twitter, enter $.entities.hashtags to loop over the hashtags entity.
  9. In the Mapping table, in which the hashtag column of the output schema has been filled in automatically, enter the element on which the extraction is performed. In this example, this is the text attribute of each hashtags entity. Therefore, enter text within double quotation marks in the Json query column.