Working with JSON Embedded Arrays and Objects with Talend

Edward Ost
Talend Data Fabric
Talend MDM Platform
Talend Data Integration
Talend Big Data
Talend Big Data Platform
Talend Real-Time Big Data Platform
Talend ESB
Talend Data Management Platform
Talend Data Services Platform
Data Governance > Third-party systems > File components (Integration) > JSON components
Design and Development > Third-party systems > File components (Integration) > JSON components
Data Quality and Preparation > Third-party systems > File components (Integration) > JSON components
Talend Studio

Working with JSON Embedded Arrays and Objects with Talend

It can be tricky working with JSON files that have embedded arrays or child objects with Talend Data Integration or Talend Big Data solutions. Talend Data Integration or Talend Big Data solutions are primarily designed for relational row/column style schemas whereas JSON is oriented toward rich JSON documents. This example walks through how to use Java Maps and Lists using Jackson and MongoDB classes in Talend Studio to handle the richer JSON document structure. It uses only a minimal amount of Java code.

When ingesting JSON documents, Talend can only handle one looping array, so even a modest array with one or two tags could be problematic. This approach provides a work around for those cases. It can help address edge cases where the document structure is simple but still has a few arrays. When more complex structures are used then TDM (Talend Data Mapper) should be the first choice.

A sample job and the sample data file are attached in the Downloads tab.


Talend Studio with Big Data is used. In the Big Data Studio, the Jackson and MongoDB libraries are already in the set of provided jars.

If you are using a Studio without Big Data, you need to find and download these jar files yourself.

Ingesting JSON Files

In the example below there customerType is an Object and opportunities is an embedded array.

{    "_id" : "551db6b46896ea0079f28a33",
     "customerId" : "c1000",
     "customerType" :{
        "code" : "var",
        "description" : "value added reseller"
     "createdDate" : "2015-04-02T21:37:56.058Z",
     "modifiedDate" : "2015-04-02T21:37:56.058Z",
     "activeFlag" : 1,    "opportunities" : [{
            "name" : "Acme",
            "value" : "7"
     {      "name" : "Coyote Rescue",
            "value" : "1"

When reading this with tFileInputJson we can map this to a flat schema. If we set the Read Jsonpath query we can loop over an array like opportunities. But this results in multiple rows per json document. This is not always desirable. In the snapshot below notice that for a single document input there are two output records.

An alternate approach is to embed the complex arrays or objects as strings in the flat data set, and then parse them into richer structures if necessary. In the screenshot below we are using json-path rather than xpath. In this case we do not need to use the looping option since we retrieve opportunities as a string.

Note that because of the gap between xml and JSON, there is no easy mechanism for using xpath to query for the entire array. So a similar approach does not work with the xpath option. While the xpath approach does not work for an array, it does work for an Object.

Parsing Embedded JSON Fields

The next challenge is to be able to parse the embedded JSON string. We could use tExtractJSONField for this. But that would not really give us any new capabilities that we did not already have with tFileInputJSON at the initial stage. Moreover, while tExtractJSONfield is OK for embedded objects, it will not work for arrays. We do not want to create multiple records for the arrays, we just want a richer Java List object within our existing record.

For this we can use the Jackson parsing JSON parsing library together with MongoDB classes. Jackson provides a very easy to use parser, and MongoDB provides utility classes for anonymous data binding to generic map structures. Since the MongoDB classes support JSON as the default serialization format, these complex objects will serialize correctly without any additional work when passed to tFileOutputJSON .

The Jackson and Mongo libraries need to be loaded via tLibraryLoad . They are already in the set of provided jars. For Talend 6.2.1 with Big Data the mongo-java-driver-3.2.1.jar can be used. For Jackson there are two jars, jackson-core-asl-1.9.13.jar and jackson-mapper-asl-1.9.13.jar.

With these libraries loaded we can instantiate a Jackson Mapper object using tSetGlobalVar as part of the pre-Job. The Jackson Object Mapper can be re-used, so we only want to instantiate one of these objects.

tJavaFlex can then be used to parse the embedded JSON string into a rich object using the Jackson Mapper with the code below.

row17.opportunities = mapper.readValue(row9.opportunities, BasicDBList.class);
row17.customerType = mapper.readValue(row9.customerType, BasicDBObject.class);

In the screenshot below, notice that the input and output schemas of the tJavaFlex have changed. customerType was ingested as a String and is now an Object (realized by BasicDBObject). Likewise, opportunities was ingested as a String and is now a List (realized by BasicDBList).

It is worth noting that we created a local Jackson mapper variable in the start code section so that we would not need to look it up for every row.


We have used the Jackson Mapper to parse the JSON string into the target MongoDB classes. BasicDBList implements the regular Java List interface, and the BasicDBObject supports a regular Java Map interface. So we can now manipulate the JSON with the Java List or Map API's respectively.

When it is time to persist the data set, we do not need to do anything. We can just pass it as is back to a tFileOutputJSON component since the MongoDB classes serialize to JSON.