You can create a semantic type based on a regular expression in Talend Dictionary Service and add it to the list
of recognized data types in Talend Data Preparation
In Talend Data Preparation, not every
type of data can currently be matched with one of the predefined semantic types. Italian
social security numbers, also known as codice fiscale, are currently not recognized for
example.
Let's say that you work for an Italian company, only dealing with Italian customers. In
this example, you need to clean some customer data, such as their names, email address,
or their social security number. The semantic type for the column containing the social
security number data will be set by default to text
. This is not
specific enough and you would like to create a new category in order to match this type
of data: a codice fiscale
semantic type in this case.
You will create this new semantic type in Talend Dictionary Service, and it will be
automatically available in Talend Data Preparation so that your data can be
matched with a proper type.
Important: For security reasons, a few regular expressions cannot be used,
especially the backreferences. For more information, see the
RE2/J
documentation.
Procedure
-
Open the Semantic types view from the left panel of the
Talend Data Preparation
homepage and click Add semantic type.
-
In the Name field, enter codice
fiscale.
-
In the Description field, enter Italian
social security number.
-
In the Type drop-down list, select Regular
expression.
-
Keep the Use for validation switch activated.
Using a regular expression, a dictionary or a compound type for validation
means that it will be used to define which values are considered right or
wrong in a given column. The result of this validation process can be seen
in the quality bar of each column in your datasets.
In any case, regular expressions or dictionary of values are used for data
discovery, that calculates the matching percentage between the reference
values and your data to define the semantic type of each column.
In this example, if you were to deactivate the switch, the regular expression
would only be used for data discovery, and no value would be considered
invalid.
-
In the Content drop-down list, select the type of
content that you want to validate, Any character in this
case.
This option helps optimizing performances. Only the data that matches the
selected type will be validated. You can choose to only validate
Alphabetic or Numeric values
against a regular expression, but because Italian social security numbers
contain both, you have to select Any character.
-
In the Validation pattern field, enter
^[A-Z]{6}[0-9]{2}[A-Z][0-9]{2}[A-Z][0-9]{3}[A-Z]$.
This regular expression is designed to match the Italian codice fiscale,
which is an alphanumeric code of 16 characters. Data that matches that
pattern in Talend Data Preparation will be
identified as codice fiscale.
-
Click Save and publish to send the new semantic type to
the Talend Dictionary Service
server and make it available to the Talend Data Preparation users.
Clicking Save as draft means that the semantic type
will be stored in Talend Dictionary Service, but will not be
broadcast to the Talend Web applications.
This allows you to chose the moment when you want to make your semantic
types public.
The codice fiscale type is now available in the list
of semantic types with the status set as
Published.
The change in semantic types is instantly effective in Talend Data Preparation for every
new dataset that you import. For existing datasets, you need to manually
change the column type or reimport your dataset.
-
Go back to your dataset containing the Italian social security numbers.
-
Click the menu icon in the codice_fiscale column header
and select .
The column type now matches the newly created category.
Results
Your data is now matched with the codice_fiscale
semantic type, that
you manually created in Talend Dictionary Service. From now on, when
importing new datasets containing Italian social security numbers, they will
automatically be matched with the proper type.