Extracting distinct values - 6.1

Talend Real-time Big Data Platform Studio User Guide

Talend Real-Time Big Data Platform
Data Quality and Preparation
Design and Development
Talend Studio

From the Profiling perspective of the studio, you can create a column analysis to compute the number of most frequent values for each distinct record in a column. After executing the column analysis, you can generate a ready-to-use Job that will extract in an output file the distinct values from a frequency table.

You can then use these distinct values as a reference data set for other data standardization processes.

In the example below a column analysis on a postal_code column in a MySQL database has been created and executed in the Profiling perspective of the studio.

Prerequisites: You have already created and executed a column analysis that uses the Frequency Table indicator.

To generate a Job that extracts distinct values from a frequency table, do the following

  1. In the analysis editor, right-click the Frequency Table indicator.

  2. Select Generate Job.

    The Integration perspective opens on the generated Job.

    The basic settings for the database component are already defined according to the database connection used in the column analysis.

    The basic settings for the tAggregateRow component are already defined to count the distinct values from the frequency table of the postal_code column.

  3. If required, use a different output component to recuperate the distinct values in a different type of file or in a database.

  4. Save your Job and press F6 to execute it.

    The Job extracts the distinct values from the frequency table and writes them in the defined output file.

    You can then use this file as a kind of a reference file in your data quality Jobs. You can use the zip codes in the file when matching data on zip codes for instance.

    For further information on the data quality components and Jobs, see the data quality chapter in the Talend Components Reference Guide.