Indexing clean and deduplicated data in Elasticsearch - Cloud - 8.0

Data matching with Talend tools

Version
Cloud
8.0
Language
English
Product
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Data matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Data matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Design and Development > Third-party systems > Data Quality components > Matching components > Continuous matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Data matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Last publication date
2024-02-06

Before you begin

  • The Elasticsearch cluster and Elasticsearch-head are started before executing the Job.

    For more information about Elasticsearch-head, which is a plugin for browsing an Elasticsearch cluster, see https://mobz.github.io/elasticsearch-head/.

Procedure

  1. Double-click the tMatchIndex component to open its Basic settings view and define its properties.
  2. In the Elasticsearch configuration area, enter the location of the cluster hosting the Elasticsearch system to be used in the Nodes field, for example:

    "localhost:9200"

  3. Enter the index to be created in Elasticsearch in the Index field, for example:

    education-agencies-chicago

  4. If you need to clean the Elasticsearch index specified in the Index field, select the Reset index check box.
  5. Enter the path to the local folder from where you want to retrieve the pairing model files in the Pairing model folder.
  6. Press F6 to save and execute the Job.

Results

tMatchIndex created the education-agencies-chicago index in Elasticsearch, populated it with the clean data and computed the best suffixes based on the blocking key values.

You can browse the index created by tMatchIndex using the plugin Elasticsearch-head.

You can now use the indexed data as a reference data set for the tMatchIndexPredict component.

For an example of how to do continuous matching, see Doing continuous matching using tMatchIndexPredict.