Setting up the Job

Procedure

Drop the following components from the Palette onto the design workspace: tFileInputDelimited, tMatchPredict and tFileOutputDelimited.
Connect tFileInputDelimited to tMatchPredict using the Main link.
Connect tMatchPredict to tFileOutputDelimited using the Suspect duplicates link.
Check that you have defined the connection to the Spark cluster and activated checkpointing in the Run > Spark Configuration view as described in Computing suspect pairs and suspect sample from source data.

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!