Which big data formats are supported
Depending on the component and target language that will be generated,
different file types (formats) are available in Talend Big Data Studio.
For
example, tHDFSInput supports both the Text and Sequence file types, but not ORC. In some
cases (for example ORC), the file format is only available with Hive and was specifically
developed to improve performance.
Description
Classic Data Integration Job | Map / Reduce Job | Spark Batch Job | Spark Streaming Job | |||
HDFS | Pig | Hive | ||||
Text File | Yes | Yes | Yes | Option in HDFS components | Yes | Yes |
Sequence File | Yes | Yes | Yes | Option in HDFS components | Yes | Yes |
RC | No | Yes | Yes | No | No | No |
ORC (since HDP 2.0 only) | No | No | Yes | No | Yes | Yes |
Avro | No | Yes | Yes | Specific Avro components | Specific Avro components | Specific Avro components |
Parquet | No | Yes | Yes | Specific Parquet components | Specific Parquet components | Specific Parquet components |
JSON | Get/Put only | Custom Loader | No | Specific JSON components | Specific JSON components | Specific JSON components |
XML | No | Custom Loader | No | Specific XML components | Specific XML components | Specific XML component |
Impala Complex Types | Yes | Yes | Yes | Yes | Yes | Yes |
Each file format was developed with specific features/benefits in mind by the Hadoop community.
We recommend that you check out the Component Reference Guide corresponding to your Talend Big Data product version to determine if a given Talend component supports a file format, as a newer version of the product may be required for some formats.
EnvironmentAll supported Hadoop distributions.