Creating a custom matching algorithm - 7.0

Data matching

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Matching components > Data matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Data matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Data matching components
EnrichPlatform
Talend Studio

The tRecordMatching component enables you to use a user-defined matching algorithm for obtaining the results you need.

A custom matching algorithm is written manually and stored in a .jar file (Java archive). Talend provides an example .jar file on the basis of which you are supposed to develop your own file easily.

Procedure

  1. In Eclipse, check out the test.mydistance project from svn at:
  2. In this project, navigate to the Java class named MyDistance.Java: http://talendforge.org/svn/top/trunk/test.mydistance/src/main/java/org/talend/mydistance//.
  3. Open this file that has the below code:
    package org.talend.mydistance;
    
    import org.talend.dataquality.record.linkage.attribute.AbstractAttributeMatcher;
    import org.talend.dataquality.record.linkage.constant.AttributeMatcherType;
    
    /**
     * @author scorreia
     * 
     * Example of Matching distance.
     */
    public class MyDistance extends AbstractAttributeMatcher {
    
        /*
         * (non-Javadoc)
         * 
         * @see org.talend.dataquality.record.linkage.attribute.IAttributeMatcher#getMatchType()
         */
        @Override
        public AttributeMatcherType getMatchType() {
            // a custom implementation should return this type AttributeMatcherType.custom
            return AttributeMatcherType.CUSTOM;
        }
    
        /*
         * (non-Javadoc)
         * 
         * @see org.talend.dataquality.record.linkage.attribute.IAttributeMatcher#getMatching
         Weight(java.lang.String,
         * java.lang.String)
         */
        @Override
        public double getWeight(String arg0, String arg1) {
            // Here goes the custom implementation of the matching distance between the two given strings.
            // the algorithm should return a value between 0 and 1.
    
            // in this example, we consider that 2 strings match if their first 4 characters are identical
            // the arguments are not null (the check for nullity is done by the caller)
            final int max = 4;
            int nbIdenticalChar = Math.min(max, Math.min(arg0.length(), arg1.length()));
            for (int c = 0; c < max; c++) {
                if (arg0.charAt(c) != arg1.charAt(c)) {
                    nbIdenticalChar = c;
                    break;
                }
            }
            return (max - nbIdenticalChar) / ((double) max);
        }
    
    }
  4. In this file, type in the class name for the custom algorithm you are creating in order to replace the default name. The default name is MyDistance and you can find it in the line: public class MyDistance implements IAttributeMatcher.
  5. In the place where the default algorithm is in the file, type in the algorithm you need to create to replace the default one. The default algorithm reads as follows:
    final int max = 4;
            int nbIdenticalChar = Math.min(max, Math.min(arg0.length(), arg1.length()));
            for (int c = 0; c < max; c++) {
                if (arg0.charAt(c) != arg1.charAt(c)) {
                    nbIdenticalChar = c;
                    break;
                }
            }
            return (max - nbIdenticalChar) / ((double) max);
  6. Save your modifications.
  7. Using Eclipse, export this new .jar file.

Results

Then this user-defined algorithm is ready to be used by the tRecordMatching component.