Testing Spark Jobs using test cases - 6.5

Talend Big Data Studio User Guide

Talend Big Data
Design and Development
Talend Studio

The test framework described in Testing Jobs using test cases is also applicable on a Spark Job during Continuous Integration development to make sure this Spark Job will function as expected when it is actually executed to handle large datasets.

You need to follow the same steps detailed in Testing Jobs using test cases to accomplish a Spark test case but be aware that a different Test Skeleton is dedicated to Spark Jobs.

By default, a Spark Test Skeleton includes:

  • one or more tFixedFlowInput components (or tBoundedStreamInput for a Spark Streaming Job), depending on the number of input flows in the Job, to load the input file(s),

  • the read-only INPUT and OUTPUT icons that are used to indicate the beginning and the end of the part to be tested,

  • one or more tCollectAndCheck components, depending on the number of output flows in the Job, to compare the temporary output file(s) with the reference file(s). The test is considered successful if the compared pair of files are identical and a failure otherwise.

In addition, the Local mode is used by default in the Spark configuration tab. Depending on the number of input and output flows, a number of context variables are automatically created to specify the input and reference files and a Use context variable radio button is available in the Basic settings tab of tFixedFlowInput or tBoundedStreamInput and is automatically selected to allow you to choose one of these new context variables to use.

Note that before creating a test case for a Job, make sure all the components of your Job have been configured.

For further information about Continuous Integration and how you can implement it with Talend, see Software Development Life Cycle Best Practices Guide.