Working with JSON Embedded Arrays and Objects with Talend
When ingesting JSON documents, Talend can only handle one looping array, so even a modest array with one or two tags could be problematic. This approach provides a work around for those cases. It can help address edge cases where the document structure is simple but still has a few arrays. When more complex structures are used then TDM (Talend Data Mapper) should be the first choice.
A sample job and the sample data file are attached in the Downloads tab.
EnvironmentTalend Studio with Big Data is used. In the Big Data Studio, the Jackson and MongoDB libraries are already in the set of provided jars.
If you are using a Studio without Big Data, you need to find and download these jar files yourself.
Ingesting JSON FilesIn the example below there customerType is an Object and opportunities is an embedded array.
{ "_id" : "551db6b46896ea0079f28a33", "customerId" : "c1000", "customerType" :{ "code" : "var", "description" : "value added reseller" }, "createdDate" : "2015-04-02T21:37:56.058Z", "modifiedDate" : "2015-04-02T21:37:56.058Z", "activeFlag" : 1, "opportunities" : [{ "name" : "Acme", "value" : "7" }, { "name" : "Coyote Rescue", "value" : "1" }] }
When reading this with tFileInputJson we can map this to a flat schema. If we set the Read Jsonpath query we can loop over an array like opportunities. But this results in multiple rows per json document. This is not always desirable. In the snapshot below notice that for a single document input there are two output records.
An alternate approach is to embed the complex arrays or objects as strings in the flat data set, and then parse them into richer structures if necessary. In the screenshot below we are using json-path rather than xpath. In this case we do not need to use the looping option since we retrieve opportunities as a string.
Note that because of the gap between xml and JSON, there is no easy mechanism for using xpath to query for the entire array. So a similar approach does not work with the xpath option. While the xpath approach does not work for an array, it does work for an Object.
Parsing Embedded JSON FieldsThe next challenge is to be able to parse the embedded JSON string. We could use tExtractJSONField for this. But that would not really give us any new capabilities that we did not already have with tFileInputJSON at the initial stage. Moreover, while tExtractJSONfield is OK for embedded objects, it will not work for arrays. We do not want to create multiple records for the arrays, we just want a richer Java List object within our existing record.
For this we can use the Jackson parsing JSON parsing library together with MongoDB classes. Jackson provides a very easy to use parser, and MongoDB provides utility classes for anonymous data binding to generic map structures. Since the MongoDB classes support JSON as the default serialization format, these complex objects will serialize correctly without any additional work when passed to tFileOutputJSON .
The Jackson and Mongo libraries need to be loaded via tLibraryLoad . They are already in the set of provided jars. For Talend 6.2.1 with Big Data the mongo-java-driver-3.2.1.jar can be used. For Jackson there are two jars, jackson-core-asl-1.9.13.jar and jackson-mapper-asl-1.9.13.jar.
With these libraries loaded we can instantiate a Jackson Mapper object using tSetGlobalVar as part of the pre-Job. The Jackson Object Mapper can be re-used, so we only want to instantiate one of these objects.
tJavaFlex can then be used to parse the embedded JSON string into a rich object using the Jackson Mapper with the code below.
... row17.opportunities = mapper.readValue(row9.opportunities, BasicDBList.class); row17.customerType = mapper.readValue(row9.customerType, BasicDBObject.class);
In the screenshot below, notice that the input and output schemas of the tJavaFlex have changed. customerType was ingested as a String and is now an Object (realized by BasicDBObject). Likewise, opportunities was ingested as a String and is now a List (realized by BasicDBList).
It is worth noting that we created a local Jackson mapper variable in the start code section so that we would not need to look it up for every row.
ConclusionWe have used the Jackson Mapper to parse the JSON string into the target MongoDB classes. BasicDBList implements the regular Java List interface, and the BasicDBObject supports a regular Java Map interface. So we can now manipulate the JSON with the Java List or Map API's respectively.
When it is time to persist the data set, we do not need to do anything. We can just pass it as is back to a tFileOutputJSON component since the MongoDB classes serialize to JSON.