How to handle mixed encoding in MapReduce Jobs
This feature is available in the Talend MapReduce Jobs since the version 5.6.2 of the Talend solutions with Big Data.
The tHDFSInput component needs to be used to read the source data containing mixed encoding.
In the version 5.6.2 of Talend Studio, the Activate advanced decoder check box has been added to the Advanced settings view of the MapReduce version of the tHDFSInput component to handle mixed encoding.
To properly use this feature, you need to proceed as follows.
The procedure described in this article is to explain this Activate advanced decoder feature only. Activating this feature alone does not allow your MapReduce Job to run successfully. You still need to properly design your Job and configure the connection to the Hadoop cluster to be used.
For further information about how to create a Talend MapReduce Job, see Designing MapReduce Jobs.
- Double-click tHDFSInput to open its Basic settings view.
- Select the Custom encoding check box and from the Encoding drop-down list, select the option that corresponds to the delimiter encoding of the source data to be read. For example, UTF-8 .
- Click the Advanced settings tab to open its view and select the Activate advanced decoder check box.
Now this tHDFSInput component is able to read the mixed encoding.
For further information about the other parameters of tHDFSInput , see tHDFSInput.Capabilities and limitations
This feature requires one of the mixed encoding types to be UTF-8.
The following table presents the encoding types which this feature has been tested to be able to successfully handle while the default decoder of tHDFSInput fails to.
|Principal encoding||Delimiter encoding||Default decoder||Advanced decoder||Comment|
|UTF-8||ISO-8859-15||The delimiter used is the thor letter ( þ ).|
|ISO-8859-15||UTF-8||The delimiter used is the thor letter ( þ ).|