Kafka and AVRO in a Job - 7.1

Kafka

author
Talend Documentation Team
EnrichVersion
7.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Open Studio for Big Data
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Messaging components (Integration) > Kafka components
Data Quality and Preparation > Third-party systems > Messaging components (Integration) > Kafka components
Design and Development > Third-party systems > Messaging components (Integration) > Kafka components
EnrichPlatform
Talend Studio
In a Talend Job, the Kafka components (the regular Kafka components) and the Kafka components for AVRO handle AVRO data differently, as is reflected in the approaches AVRO provides to (de)serialize the data of AVRO format.
  • The regular Kafka components read and write the JSON format only. Therefore, if your Kafka produces or consumes AVRO data and for some reason, the Kafka components for AVRO are not available, you must use an avro-tools library to convert your data between AVRO and JSON outside your Job.
    For example,
    java -jar C:\2_Prod\Avro\avro-tools-1.8.2.jar tojson out.avro
    You can download the avro-tools-1.8.2.jar library used in this example from the MVN Repository. This command converts the out.avro file to json.
    Or
    java -jar avro-tools-1.8.2.jar fromjson --schema-file twitter.avsc twitter.json > twitter.avro
    This command converts the twitter.json file to twitter.avro using the schema from twitter.avsc.
  • The Kafka components for AVRO are available in the Spark framework only; they handle data directly in the AVRO format. If your Kafka cluster produces and consumes AVRO data, use tKafkaInputAvro to read data directly from Kafka and tWriteAvroFields to send AVRO data to tKafkaOutput.

    However, these components do not handle the AVRO data created by an avro-tools library, because the avro-tools libraries and the components for AVRO do not use the same approach provided by AVRO.

The two approaches AVRO provides to (de)serialize the data of AVRO format are as follows:
  1. AVRO files are generated with the embedded AVRO schema in each file (via org.apache.avro.file.{DataFileWriter/DataFileReader}). The avro-tools libraries use this approach.
  2. AVRO records are generated without embedding the schema in each record (via org.apache.avro.io.{BinaryEncoder/BinaryDecoder}). The Kafka components for AVRO use this approach.

    This approach is highly recommended and favored when AVRO encoded messages are constantly written to a Kafka topic, because in this approach, no overhead is incurred to re-embed the AVRO schema in every single message. This is a significant advantage over the other approach when using Spark Streaming to read data from or write data to Kafka, since records (messages) are usually small while the size of the AVRO schema is relatively large, so embedding the schema in each message is not cost-effective.

The outputs of the two approaches cannot be mixed in the same read-write process.