Analyzing columns in a delimited file - 6.1

Talend Real-time Big Data Platform Studio User Guide

EnrichVersion
6.1
EnrichProdName
Talend Real-Time Big Data Platform
task
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

The sequence of profiling data in a delimited file involves the following steps:

  1. defining the columns to be analyzed.

    For more information, see How to define the columns to be analyzed.

  2. settings predefined system indicators for the defined columns.

    For more information, see How to set indicators on columns. For more information on indicator types and indicator management, see Indicators.

  3. setting patterns for the defined columns. For more information, see Patterns.

    You can also use Java user-defined indicators when analyzing columns in a delimited file on the condition that a Java user-defined indicator is already created. For further information, see How to define Java user-defined indicators.

The following sections provide a detail description on each of the preceding steps.

How to define the columns to be analyzed

The first step in analyzing the content of a delimited file is to define the columns to be analyzed.

Prerequisite(s): At least one connection to a delimited file is set in the Profiling perspective of the studio. For further information, see How to connect to a delimited file.

Defining the analysis

  1. In the DQ Repository tree view, expand the Data Profiling folder.

  2. Right-click the Analysis folder and select New Analysis.

    The [Create New Analysis] wizard opens.

  3. In the filter field, start typing Basic Column Analysis, select Basic Column Analysis and then click Next.

  4. In the Name field, enter a name for the current column analysis.

    Note

    Avoid using special characters in the item names including:

    "~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

  5. If required, set the analysis metadata (purpose, description and author name) in the corresponding fields and click Next to proceed to the next step.

Selecting the file columns and setting sample data

  1. Expand FileDelimited connections and then browse to the columns you want to analyze.

    In this example, you want to analyze the id, first_name and age columns from the selected connection.

  2. Select the columns and then click Finish to close the wizard.

    A file for the newly created analysis is displayed under the Analyses node in the DQ Repository tree view, and the analysis editor opens with the analysis metadata.

  3. In the Data preview view, click Refresh Data.

    The data in the selected columns is displayed in the table.

    You can change your data source and your selected columns by using the New Connection and Select Data buttons respectively.

  4. In the Limit field, set the number for the data records you want to display in the table and use as sample data, 50 records for example.

  5. Select n first rows to list the first 50records from the selected columns.

  6. In the Analyzed Columns view, use the arrows in the top right corner to open different pages in the view if you analyze large number of columns..

    You can also drop the columns to analyze directly from the DQ Repository tree view to the analysis editor.

  7. Use the delete, move up or move down buttons to manage the analyzed columns.

  8. If required, right-click any of the listed columns and select Show in DQ Repository view to locate the selected column under the corresponding delimited file connection in the tree view.

When you select to analyze Date columns and run the analysis with the Java engine, the date information is stored in the studio and in the datamart as regular date/time of format YYYY-MM-DD HH:mm:ss.SSS for date/timestamp and of format HH:mm:ss.SSS for time. The date and time formats are slightly different when you run the analysis with the SQL engine.

How to set system and user-defined indicators

The second step after defining the columns to be analyzed is to set statistics indicators for each of the defined columns.

Note

You can also use Java user-defined indicators when analyzing columns in a delimited file on the condition that a Java user-defined indicator is already created. For further information, see How to define Java user-defined indicators.

Prerequisite(s): An analysis of a delimited file is open in the analysis editor in the Profiling perspective of the studio. For more information, see How to define the columns to be analyzed.

To set system indicators for the column(s) to be analyzed, do the following:

  1. Follow the procedure outlined in How to define the columns to be analyzed.

  2. From the Data preview view in the analysis editor, click Select indicators to open the [Indicator Selection] dialog box.

  3. Set the indicators using the [Indicator Selection] dialog box as outlined in How to set system or user-defined indicators.

    In this example, you want to set the Simple Statistics indicators on all columns, the Text Statistics indicators on the first_name column and the Soundex Frequency Table on the first_name column as well.

    Note

    You can set the text statistics indicators on a column only if its data mining type is set to nominal. Otherwise, these indicators will be grayed out in the [Indicator Selection] dialog box.

    The selected indicators are attached to the analyzed columns in the Analyzed Columns view.

  4. Save the analysis.

How to set options for system indicators

Prerequisite(s): An analysis of a delimited file is open in the analysis editor in the Profiling perspective of the studio. For more information, see How to define the columns to be analyzed, How to set indicators on columns.

To set options for system indicators used on the columns to be analyzed, do the following:

  1. Follow the procedures outlined in How to define the columns to be analyzed and How to set indicators on columns.

  2. In the analysis editor, click Analyzed Columns to open the analyzed columns view.

  3. In the Analyzed Columns list, click the option icon next to the indicator to open the dialog box where you can set options for the given indicator.

    Note

    Indicators settings dialog boxes differ according to the parameters specific for each indicator. For more information about different indicator parameters, see Indicator parameters.

  4. Set the parameters for the given indicator.

  5. Click Finish to close the dialog box.

  6. Click the save icon on the toolbar of the analysis editor.

How to set regular expressions and finalize the analysis

You can add one or more regular expressions to one or more of the analyzed columns.

Prerequisite(s): An analysis of a delimited file is open in the analysis editor in the Profiling perspective of the studio. For more information, see How to define the columns to be analyzed, How to set indicators on columns and How to set options for system indicators.

To set regular expressions to the analyzed columns, do the following:

  1. Define the regular expression you want to add to the analyzed column. For further information on creating regular expressions, see How to create a new regular expression or SQL pattern.

    In this example, the regular expression checks for all words that start with uppercase.

  2. Add the regular expression to the analyzed column in the open analysis editor, the first_name column in this example. For further information, see How to add a regular expression or an SQL pattern to a column analysis.

  3. Click the save icon on the toolbar of the analysis editor and then press F6 to execute the analysis.

    Note

    If the format of the file you are using has problems, you will have an error message to indicate which row causes the problem.

    The Graphics panel to the right of the analysis editor displays a group of graphic(s), each corresponding to one of the analyzed columns.

  4. If you analyze more than one column, navigate through the different pages in the Graphics panel using the toolbar on the upper-right corner in order to view the different graphics associated with all analyzed columns.

In order to view detail results of the analyzed columns, see How to access the detailed view of the analysis results.

How to access the detailed view of the file analysis

Prerequisite(s): An analysis of a delimited file is defined and executed in the Profiling perspective of the studio. For more information, see Analyzing columns in a delimited file.

To access a more detailed view of the analysis results, do the following:

  1. Click the Analysis Results tab at the bottom of the analysis editor to open the corresponding view.

  2. Click Analysis Result and then the name of the analyzed column for which you want to display the detailed results.

The detailed analysis results view shows the generated graphics for the analyzed columns accompanied with tables that detail the statistic results.

Below are the tables that accompany the statistics graphics in the Analysis Results view for the analyzed first_name column in the procedure outlined in Analyzing columns in a delimited file.