How to define the columns to be analyzed

Talend Data Management Platform Studio User Guide

EnrichVersion
6.2
EnrichProdName
Talend Data Management Platform
task
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

The first step in analyzing the content of a delimited file is to define the columns to be analyzed.

Prerequisite(s): At least one connection to a delimited file is set in the Profiling perspective of the studio. For further information, see Connecting to a file.

Defining the analysis

  1. In the DQ Repository tree view, expand the Data Profiling folder.

  2. Right-click the Analysis folder and select New Analysis.

    The [Create New Analysis] wizard opens.

  3. In the filter field, start typing Basic Column Analysis, select Basic Column Analysis and then click Next.

  4. In the Name field, enter a name for the current column analysis.

    Note

    Avoid using special characters in the item names including:

    "~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

  5. If required, set the analysis metadata (purpose, description and author name) in the corresponding fields and click Next to proceed to the next step.

Selecting the file columns and setting sample data

  1. Expand FileDelimited connections and then browse to the columns you want to analyze.

    In this example, you want to analyze the id, first_name and age columns from the selected connection.

  2. Select the columns and then click Finish to close the wizard.

    A file for the newly created analysis is displayed under the Analyses node in the DQ Repository tree view, and the analysis editor opens with the analysis metadata.

  3. In the Data preview view, click Refresh Data.

    The data in the selected columns is displayed in the table.

    You can change your data source and your selected columns by using the New Connection and Select Columns buttons respectively.

  4. In the Limit field, set the number for the data records you want to display in the table and use as sample data, 50 records for example.

  5. Select n first rows to list the first 50records from the selected columns.

  6. In the Analyzed Columns view, use the arrows in the top right corner to open different pages in the view if you analyze large number of columns..

    You can also drop the columns to analyze directly from the DQ Repository tree view to the analysis editor.

  7. Use the delete, move up or move down buttons to manage the analyzed columns.

  8. If required, right-click any of the listed columns and select Show in DQ Repository view to locate the selected column under the corresponding delimited file connection in the tree view.

When you select to analyze Date columns and run the analysis with the Java engine, the date information is stored in the studio and in the datamart as regular date/time of the format YYYY-MM-DD HH:mm:ss for date/timestamp and of the format HH:mm:ss.SSS for time. The date and time formats are slightly different when you run the analysis with the SQL engine.