Most translators have probably heard of fuzzy matches (a similar, but not identical, match to the source segment found in a translator's translation memory  ); however, few understand exactly how a fuzzy match is calculated. For good reason, as CAT tool vendors seem to be overly protective of their calculation.
Today's post will look at some of the factors that might come into play when calculating a fuzzy match. Although I can't give you the answer to how other tools calculate their fuzzy matches, hopefully I can shed some light on the different possible ways a fuzzy match could be calculated.
With your favorite CAT or TEnT tool you can utilize fuzzy matches to get more leverage from your translation memory. If your tool only showed you exact matches (aka 100% matches) you would be missing out.
As you can see, except for the name highlighted in orange, the sentences are the same. By leveraging fuzzy matching in this example, a translator would only have to translate one word instead of the whole sentence.
A fuzzy match can be any match between 1% and 99%. What exactly does the percentage mean - well, that is the million dollar question that I'll explore in more detail below.
An exact match (aka 100% match) is when your source segment matches word for word a segment in your translation memory.
Note: some tools may only look at the preceding segment when determining a context match.
As fuzzy matches with a low percentage are usually not particularly useful for a translator, CAT and TEnT tools typically allow you to set a fuzzy match threshold. For example, if you set your fuzzy match threshold at 75% the tool will not show you any matches that are below 75%.
An internal fuzzy match is a fuzzy match that is found within the source document you are translating. So, for example, if the first sentence in the document you are translating is "John went to the store." and the last sentence is "John went to the store again." you have an internal fuzzy match. The fuzzy match isn't coming from your translation memory, but from the current document you are translating.
Fuzzy match repair involves using machine translation to try and "repair" the portion of the segment from the translation memory that does not match the source segment. This is a more recent innovation that may not yet be available in all tools.
There is no set method or formula for how to calculate a fuzzy match. Although tool makers do not release their algorithms, it is easy to see that they all use slightly different calculations just by trying out some tests and seeing that each tool produces a different fuzzy match score for the same test example.
So what are the factors that influence these variations in fuzzy match scores?
Should stop words be given the same weight as other words, or should their weight be adjusted?
In the above example, if stop words are weighted the same as regular words we might end up with a fuzzy match score of 65%. 3 of the 6 words are the same and 1 word has one additional character (Service vs. Services). Thus 4/6 = 66% minus a 1% penalty for the additional character is one potential way the above fuzzy match score could be calculated. If we ignore stop words it quickly becomes more obvious that the above two segments are not a useful match.
One method of calculating a fuzzy match score could be to think of each individual word as a unit that can have only two states - it is either a match or it is not a match.
Thus Service and Services would not be a match.
Another method would be to examine the words that are not a match even deeper and see how close they might be (i.e. are they a true 0% match, or could there be a partial substring match). This involves using techniques similar to how spell checkers work. Techniques such as the Levenshtein distance that look at how many single-character edits (insertions, deletions or substitutions) it takes to change one word into the other.
In our example above it takes one insertion to change Service into Services. Thus, instead of treating words as a binary (match or not match) a potential fuzzy match algorithm could give a weight to words that show a partial substring match.
Many CAT and TEnT tools use the concept of tags to encode information about the format or structure of a word, phrase or segment.
Formatting and tags are another area where different fuzzy match algorithms might produce different results.
How should the above be treated? Although there is an exact match within the translation memory segment, there is still work that the translator needs to do to extract that information. Should this be a 100% match (probably not), a 99% match (one may argue), or something else?
When you are working in your favorite CAT or TEnT tool, the performance (speed) of that tool is very important. Especially as the size of your translation memory grows, the performance of a fuzzy match algorithm can have a big impact on your workflow.
This is one thing to keep in mind when evaluating different options for fuzzy match algorithms. It might be worth forgoing some "accuracy" (i.e. is this a 67% or a 60% match) for speed.
What works for English or Spanish or French may not work for Chinese or Japanese or Korean. A generic fuzzy match algorithm might have cost or performance benefits, while a solution tailored by language might be more accurate, but be more expensive to develop.
There are different factors to take into account when thinking about an ideal fuzzy match algorithm. Additionally, the ideal algorithm with respect to accuracy might not be the ideal algorithm when thinking about performance on today's hardware.
Over the coming weeks and months TM-Town will be building out more functionality to search and analyze your translation memories and termbases. Once implemented, I will follow-up in the future with a more detailed description of how TM-Town scores fuzzy matches. As with other parts of TM-Town (such as rates), I believe transparency is important to build trust within the community.
If you'd like to be notified of future blog posts, please sign up for TM-Town's mailing list. If you are a translator, give TM-Town a try today - it is free to register. Finally, if you know of any other factors that impact fuzzy match calculations, please leave a comment below!