Formatting social security numbers using Magic Fill - Cloud

Talend Cloud Data Preparation User Guide

author
Talend Documentation Team
EnrichVersion
Cloud
EnrichProdName
Talend Cloud
task
Data Quality and Preparation > Cleansing data
EnrichPlatform
Talend Data Preparation

You can use the Magic Fill function to automatically format social security numbers, based on a pattern defined by examples.

In this example, you are working on a dataset containing customer information such as names, email addresses, but most importantly, social security numbers. The problem is that those social security numbers have been entered in the wrong format, without the hyphens. For example, a number such as 334-20-6259, has been entered as 334206259 in this dataset. Because the format is wrong, most of the data is interpreted as Bank Routing Transit Numbers by the application, and considered invalid on top of that.

To fix this mistake, you are going to use the Magic Fill function and apply the correct social security number format on your data, following a pattern based on a few examples you will input.

Procedure

  1. Click the header of the ssn column to select its content.
  2. In the functions panel, type Magic fill and click the result to display the options of the associated function.
  3. In the Input 1 field, enter one of the value from the ssn column that you would like to transform, 472989590 for example.
  4. In the Output 1 field, enter the same value, but with the correct format this time: 472-98-9590.
    For the function to work, you need to enter at least two complete examples of the transformation you want to apply. You can then add up to three other examples. Examples can either be taken from your dataset, or made up. The more examples you input, the more accurately the pattern will be identified by the function.
  5. Enter more before and after examples in the remaining fields:
    • 428990989 as Input 2 and 428-99-0989 as Output 2
    • 281066499 as Input 3 and 281-06-6499 as Output 3
    • 244356471 as Input 4 and 244-35-6471 as Output 4
    • 233262165 as Input 5 and 233-26-2165 as Output 5
  6. Click Submit.

Results

A new column is created, where the transformation defined by your examples, has been applied to the rest of the numbers. The social security numbers are now in the correct format and are automatically detected as such, as shown by the semantic type in the header of this new column. You can now delete the other column with the invalid data.