Analyzing discrete data - 6.1

Talend Real-time Big Data Platform Studio User Guide

Talend Real-Time Big Data Platform
Data Quality and Preparation
Design and Development
Talend Studio

This analysis enables you to analyze numerical data. It creates a column analysis in which indicators, appropriate for numeric data, are assigned to the column by default.

Discrete data can only take particular values of potentially an infinite number of values. Continuous data is the opposite of discrete data, it is not restricted to defined separate values, but can occupy any value over a continuous range.

This analysis uses the Bin Frequency Table indicator that you must configure further in order to convert continuous data into discrete bins (ranges) according to your needs.

Prerequisite(s): At least one database connection is set in the Profiling perspective of the studio. For further information, see Connecting to a database.

Defining the analysis

  1. In the DQ Repository tree view, expand Metadata and browse to the numerical column you want to analyze.

  2. Right-click the numerical column and select Column Analysis > Discrete data Analysis.

    In this example, you want to convert customer age into a number of discrete bins, or range of age values.

    The [New Analysis] wizard opens.

  3. In the Name field, enter a name for the analysis.


    Avoid using special characters in the item names including:

    "~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

  4. Set the analysis metadata and click Finish.

    The analysis opens in the analysis editor and the Simple Statistics and the Bin Frequency Table indicators are automatically assigned to the numeric column.

  5. Double-click the Bin Frequency Table indicator to open the [Indicator settings] dialog box.

  6. Set the bins minimum and maximum values and the number of bins in the corresponding fields.

    If you set the number of bins is set to 0, no bin is created. The indicator computes the frequency of each value of the column.

  7. Select the Set ranges manually check box.

    The four read-only fields in the lower part of the [Create Bins] dialog box show you the data that Tableau uses to suggest a bin size. You can also consider these values if you want to set a bin size manually.

    Continuous numeric data is aggregated into discrete bins. Four ranges are listed in the table with a suggested bin size. The minimal value is the beginning of the first bin, and the maximal value is the end of the last bin. The size of each bin is determined by dividing the difference between the smallest and the largest values by the number of bins.

    You can always modify these values if you want to set a bin size manually. The value in the number of bins field is updated automatically with the new range number.

Running the analysis and accessing the detail analysis results

  1. Run the analysis.

    Each bin acts as a container that summarizes numeric data for a specific range of values.

    A group of graphics display the results in the Graphics panel to the right of the analysis editor.

  2. Click the Analysis Results tab at the bottom of the analysis editor to open the corresponding view.

    The analysis creates age ranges with limited and discrete set of possible values out of an unlimited, continuous range of age values.

  3. Right-click any data row in the result tables or in the charts, the first age range in this example, and select View rows to access a view of the analyzed data.

    The SQL Editor opens listing all customers whose age is between 28 and 39.