How to Pivot a Data File

author
Dale Anderson
EnrichVersion
6.4
6.3
6.2
EnrichProdName
Talend Data Fabric
Talend Big Data Platform
Talend Real-Time Big Data Platform
Talend Big Data
Talend MDM Platform
Talend Data Integration
Talend Data Services Platform
Talend Data Management Platform
Talend ESB
task
Data Governance > Third-party systems > Logs and errors components (Integration)
Design and Development > Third-party systems > Logs and errors components (Integration)
Design and Development > Third-party systems > Custom code components (Integration)
Data Quality and Preparation > Third-party systems > Custom code components (Integration)
Data Quality and Preparation > Third-party systems > Logs and errors components (Integration)
Data Governance > Third-party systems > Custom code components (Integration)
EnrichPlatform
Talend Studio

How to Pivot a Data File

There are times when an Input file needs to pivot for output to satisfy downstream requirements. As Talend does not supply a 'File Pivot' component this article demonstrates a functional way to accomplish this using the tJavaFlex component.
Use Case

Let's examine a specific use case to show how this is done. Say you have an input file which contains rows representing each column in a pivoted target table. In this example the input file has 3 columns and the desired output has 25 columns:

  • Key (obfuscated)
  • Column Number (boxed in red)
  • Value

Here is a sample:

Notice that not all the columns are available for each logical grouping for the pivoted output. This represents a complication to the problem, yet the solution design below works regardless. The one requirement however is that the input file is properly sorted on the 'key' and 'column number' as it will be processed sequentially. The sort then becomes part of the solution design.

The following job design shows the solution. 1st the input file is sorted and stored in a temporary file. If successful the sorted file is processed. The tJavaFlex component wraps the data flow as records are read and checks the 'key'

Look carefully at the Java code in the 'Start', 'Main', and 'End' sections of the tJavaFlex component settings. This is where the magic happens.

The 'Start code' section outputs a message indicating the data flow processing has begun. An 'isFirstROW' flag is set to 'true' for controlling logical record break at the beginning of the process. Two variables, 'rowCOUNT', and 'fieldCOUNT' are declared and initialized to capture how many records were processed from the input file and how many fields were found. These values are available for output messaging useful for debugging purposes. A series of string variables including the 'curKEY' and 'field1-25' are declared and initialized.

The 'Main code' section is the important part. First the 'rowCOUNT' is incremented. Then a test is made against the 'curKEY' value comparing it to the input record. When the values are NOT the same, and it is NOT the first row, then the 'field1-25' values are assigned to the output schema. Otherwise a subordinate test is made to determine the records column number and saved in its corresponding 'field1-25' variable. Notice that once the logical record break occurs the 'curKEY' has the new value and all the 'field1-25' variables are initialized back to null.

The 'End code' section outputs a message indicating the data flow processing has completed.

Once the fully pivoted record values are assigned and assigned to the output schema, the job pushes them downstream where any duplicates are filtered out (as a precaution) and written to a newly pivoted file.

There you go!

Notice also that very specific error handling is incorporated into the job and the ' logPROCESSOR ' joblet which contains a tLogCatcher component that processes any exceptions encountered.

Conclusion

Where there may be other methods that can be devised to pivot input data, this method is straightforward and requires minimal code. Thanks ' tJavaFlex '!