Most translators have probably heard of fuzzy matches (a similar, but not identical, match to the source segment found in a translator's translation memory  ); however, few understand exactly how a fuzzy match is calculated. For good reason, as CAT tool vendors seem to be overly protective of their calculation.
Today's post will look at some of the factors that might come into play when calculating a fuzzy match. Although I can't give you the answer to how other tools calculate their fuzzy matches, hopefully I can shed some light on the different possible ways a fuzzy match could be calculated.
With your favorite CAT or TEnT tool you can utilize fuzzy matches to get more leverage from your translation memory. If your tool only showed you exact matches (aka 100% matches) you would be missing out.
Take for example the following sentences:
As you can see, except for the name highlighted in orange, the sentences are the same. By leveraging fuzzy matching in this example, a translator would only have to translate one word instead of the whole sentence.
A fuzzy match can be any match between 1% and 99%. What exactly does the percentage mean - well, that is the million dollar question that I'll explore in more detail below.
An exact match (aka 100% match) is when your source segment matches word for word a segment in your translation memory.
A context match (aka in-context match, 101% match) is when your segment is an exact match to a segment in your translation memory, including its context (i.e. the preceding and following segments are also exact matches).
Note: some tools may only look at the preceding segment when determining a context match.
As fuzzy matches with a low percentage are usually not particularly useful for a translator, CAT and TEnT tools typically allow you to set a fuzzy match threshold. For example, if you set your fuzzy match threshold at 75% the tool will not show you any matches that are below 75%.
An internal fuzzy match is a fuzzy match that is found within the source document you are translating. So, for example, if the first sentence in the document you are translating is "John went to the store." and the last sentence is "John went to the store again." you have an internal fuzzy match. The fuzzy match isn't coming from your translation memory, but from the current document you are translating.
Fuzzy match repair involves using machine translation to try and "repair" the portion of the segment from the translation memory that does not match the source segment. This is a more recent innovation that may not yet be available in all tools.
There is no set method or formula for how to calculate a fuzzy match. Although tool makers do not release their algorithms, it is easy to see that they all use slightly different calculations just by trying out some tests and seeing that each tool produces a different fuzzy match score for the same test example.
So what are the factors that influence these variations in fuzzy match scores?
Common factors that influence fuzzy match score calculations include:
When determining the fuzzy match percentage it is important to take into account the word order. Take for example the two segments below . If word order is ignored, these segments would be an exact match (each segment has the same 4 words and one question mark). However, it is clear to see that this should not be an exact match. The segments have very different meanings.
When calculating a fuzzy match, should punctuation be:
Punctuation is a difficult issue. Sometimes punctuation is not important in determining whether a translator can leverage a segment.
Other times punctuation can completely change the meaning of the segment, and thus ignoring the punctuation may have negative consequences. For example :
Should stop words be given the same weight as other words, or should their weight be adjusted?
Take the following example :
In the above example, if stop words are weighted the same as regular words we might end up with a fuzzy match score of 65%. 3 of the 6 words are the same and 1 word has one additional character (Service vs. Services). Thus 4/6 = 66% minus a 1% penalty for the additional character is one potential way the above fuzzy match score could be calculated. If we ignore stop words it quickly becomes more obvious that the above two segments are not a useful match.
One method of calculating a fuzzy match score could be to think of each individual word as a unit that can have only two states - it is either a match or it is not a match.
Thus Service and Services would not be a match.
Another method would be to examine the words that are not a match even deeper and see how close they might be (i.e. are they a true 0% match, or could there be a partial substring match). This involves using techniques similar to how spell checkers work. Techniques such as the Levenshtein distance that look at how many single-character edits (insertions, deletions or substitutions) it takes to change one word into the other.
In our example above it takes one insertion to change Service into Services. Thus, instead of treating words as a binary (match or not match) a potential fuzzy match algorithm could give a weight to words that show a partial substring match.
Many CAT and TEnT tools use the concept of tags to encode information about the format or structure of a word, phrase or segment.
Formatting and tags are another area where different fuzzy match algorithms might produce different results.
Take for example the following two segments. Should they be treated as a 100% match, or should there be a penalty applied as the formatting is not exactly the same?
Inevitably there will be cases where the source segment is an exact match to a substring of a segment in the translation memory. For example:
How should the above be treated? Although there is an exact match within the translation memory segment, there is still work that the translator needs to do to extract that information. Should this be a 100% match (probably not), a 99% match (one may argue), or something else?
When you are working in your favorite CAT or TEnT tool, the performance (speed) of that tool is very important. Especially as the size of your translation memory grows, the performance of a fuzzy match algorithm can have a big impact on your workflow.
This is one thing to keep in mind when evaluating different options for fuzzy match algorithms. It might be worth forgoing some "accuracy" (i.e. is this a 67% or a 60% match) for speed.
What works for English or Spanish or French may not work for Chinese or Japanese or Korean. A generic fuzzy match algorithm might have cost or performance benefits, while a solution tailored by language might be more accurate, but be more expensive to develop.
There are different factors to take into account when thinking about an ideal fuzzy match algorithm. Additionally, the ideal algorithm with respect to accuracy might not be the ideal algorithm when thinking about performance on today's hardware.
Over the coming weeks and months TM-Town will be building out more functionality to search and analyze your translation memories and termbases. Once implemented, I will follow-up in the future with a more detailed description of how TM-Town scores fuzzy matches. As with other parts of TM-Town (such as rates), I believe transparency is important to build trust within the community.
If you'd like to be notified of future blog posts, please sign up for TM-Town's mailing list. If you are a translator, give TM-Town a try today - it is free to register.
If you would like to leave a comment please sign in to your TM-Town account. If you are not a TM-Town member you can easily register for a free account.