The tRecordMatching component enables you to use
a user-defined matching algorithm for obtaining the results you need.
A custom matching algorithm is written manually and stored in a
.jar file (Java archive).
Talend
provides an
example .jar file on the basis of which you are supposed to
develop your own file easily.
Procedure
-
In Eclipse, check out the test.mydistance project
from svn at:
-
In this project, navigate to the Java class
named MyDistance.Java: https://github.com/Talend/tdq-studio-se/tree/master/sample/test.mydistance/src/main/java/org/talend/mydistance.
-
Open this file that has the below code:
package org.talend.mydistance;
import org.talend.dataquality.record.linkage.attribute.AbstractAttributeMatcher;
import org.talend.dataquality.record.linkage.constant.AttributeMatcherType;
/**
* @author scorreia
*
* Example of Matching distance.
*/
public class MyDistance extends AbstractAttributeMatcher {
/*
* (non-Javadoc)
*
* @see org.talend.dataquality.record.linkage.attribute.IAttributeMatcher#getMatchType()
*/
@Override
public AttributeMatcherType getMatchType() {
// a custom implementation should return this type AttributeMatcherType.custom
return AttributeMatcherType.CUSTOM;
}
/*
* (non-Javadoc)
*
* @see org.talend.dataquality.record.linkage.attribute.IAttributeMatcher#getMatchingWeight(java.lang.String,
* java.lang.String)
*/
@Override
public double getWeight(String arg0, String arg1) {
// Here goes the custom implementation of the matching distance between the two given strings.
// the algorithm should return a value between 0 and 1.
// in this example, we consider that 2 strings match if their first 4 characters are identical
// the arguments are not null (the check for nullity is done by the caller)
int MAX_CHAR = 4;
final int max = Math.min(MAX_CHAR, Math.min(arg0.length(), arg1.length()));
int nbIdenticalChar = 0;
for (; nbIdenticalChar < max; nbIdenticalChar++) {
if (arg0.charAt(nbIdenticalChar) != arg1.charAt(nbIdenticalChar)) {
break;
}
}
if (arg0.length() < MAX_CHAR && arg1.length() < MAX_CHAR) {
MAX_CHAR = Math.max(arg0.length(), arg1.length());
}
return (nbIdenticalChar) / ((double) MAX_CHAR);
}
-
In this file, type in the class name for the custom algorithm you are
creating in order to replace the default name. The default name is
MyDistance and you can find it in the line:
public class MyDistance implements
IAttributeMatcher
.
-
In the place where the default algorithm is in the file, type in the
algorithm you need to create to replace the default one. The default
algorithm reads as follows:
int MAX_CHAR = 4;
final int max = Math.min(MAX_CHAR, Math.min(arg0.length(), arg1.length()));
int nbIdenticalChar = 0;
for (; nbIdenticalChar < max; nbIdenticalChar++) {
if (arg0.charAt(nbIdenticalChar) != arg1.charAt(nbIdenticalChar)) {
break;
}
}
if (arg0.length() < MAX_CHAR && arg1.length() < MAX_CHAR) {
MAX_CHAR = Math.max(arg0.length(), arg1.length());
}
return (nbIdenticalChar) / ((double) MAX_CHAR);
-
Save your modifications.
-
Using Eclipse, export this new .jar file.
Results
Then this user-defined algorithm is ready to be used by the tRecordMatching component.