How to handle mixed encoding in MapReduce Jobs

Talend Documentation Team
Talend Data Fabric
Talend Big Data Platform
Talend Big Data
Talend Real-Time Big Data Platform
Design and Development > Designing Jobs > Job Frameworks > MapReduce
Talend Studio

How to handle mixed encoding in MapReduce Jobs

When the source data you need to read from HDFS contains mixed encoding, for example, UTF-8 with the delimiter in ISO-8859-15, you need to activate the related feature provided by the tHDFSInput component to help read the data.


This feature is available in the Talend MapReduce Jobs since the version 5.6.2 of the Talend solutions with Big Data.

The tHDFSInput component needs to be used to read the source data containing mixed encoding.


In the version 5.6.2 of Talend Studio, the Activate advanced decoder check box has been added to the Advanced settings view of the MapReduce version of the tHDFSInput component to handle mixed encoding.

To properly use this feature, you need to proceed as follows.


The procedure described in this article is to explain this Activate advanced decoder feature only. Activating this feature alone does not allow your MapReduce Job to run successfully. You still need to properly design your Job and configure the connection to the Hadoop cluster to be used.

For further information about how to create a Talend MapReduce Job, see Designing MapReduce Jobs.

  1. Double-click tHDFSInput to open its Basic settings view.

  2. Select the Custom encoding check box and from the Encoding drop-down list, select the option that corresponds to the delimiter encoding of the source data to be read. For example, UTF-8 .
  3. Click the Advanced settings tab to open its view and select the Activate advanced decoder check box.

Now this tHDFSInput component is able to read the mixed encoding.

For further information about the other parameters of tHDFSInput , see tHDFSInput.

Capabilities and limitations

This feature requires one of the mixed encoding types to be UTF-8.

The following table presents the encoding types which this feature has been tested to be able to successfully handle while the default decoder of tHDFSInput fails to.

Principal encoding Delimiter encoding Default decoder Advanced decoder Comment
UTF-8 ISO-8859-15 The delimiter used is the thor letter ( þ ).
ISO-8859-15 UTF-8 The delimiter used is the thor letter ( þ ).