The Fuzziness of Fuzzy Matches

Most translators have probably heard of fuzzy matches (a similar, but not identical, match to the source segment found in a translator's translation memory _{[1] [2]}); however, few understand exactly how a fuzzy match is calculated. For good reason, as CAT tool vendors seem to be overly protective of their calculation.

Today's post will look at some of the factors that might come into play when calculating a fuzzy match. Although I can't give you the answer to how other tools calculate their fuzzy matches, hopefully I can shed some light on the different possible ways a fuzzy match could be calculated.

A little background

With your favorite CAT or TEnT tool you can utilize fuzzy matches to get more leverage from your translation memory. If your tool only showed you exact matches (aka 100% matches) you would be missing out.

Take for example the following sentences:

John went to the store. (source segment)
Bob went to the store. (translation memory segment)

As you can see, except for the name highlighted in orange, the sentences are the same. By leveraging fuzzy matching in this example, a translator would only have to translate one word instead of the whole sentence.

Types of matches

Fuzzy Match (1% - 99%)

A fuzzy match can be any match between 1% and 99%. What exactly does the percentage mean - well, that is the million dollar question that I'll explore in more detail below.

Example:

John went to the store. (source segment)
Bob went to the store. (translation memory segment)

Exact Match (100%)

An exact match (aka 100% match) is when your source segment matches word for word a segment in your translation memory.

Example:

John went to the store. (source segment)
John went to the store. (translation memory segment)

Context Match (101%)

A context match (aka in-context match, 101% match) is when your segment is an exact match to a segment in your translation memory, including its context (i.e. the preceding and following segments are also exact matches).

Note: some tools may only look at the preceding segment when determining a context match.

Example:

It was a rainy day. John went to the store. He needed to buy some eggs. (source segment)
It was a rainy day. John went to the store. He needed to buy some eggs (translation memory segment)

Other terms you may come across

Fuzzy Match Threshold

As fuzzy matches with a low percentage are usually not particularly useful for a translator, CAT and TEnT tools typically allow you to set a fuzzy match threshold. For example, if you set your fuzzy match threshold at 75% the tool will not show you any matches that are below 75%.

Internal Fuzzy Match

An internal fuzzy match is a fuzzy match that is found within the source document you are translating. So, for example, if the first sentence in the document you are translating is "John went to the store." and the last sentence is "John went to the store again." you have an internal fuzzy match. The fuzzy match isn't coming from your translation memory, but from the current document you are translating.

Fuzzy Match Repair

Fuzzy match repair involves using machine translation to try and "repair" the portion of the segment from the translation memory that does not match the source segment. This is a more recent innovation that may not yet be available in all tools.

Factors that may influence a fuzzy match calculation

There is no set method or formula for how to calculate a fuzzy match. Although tool makers do not release their algorithms, it is easy to see that they all use slightly different calculations just by trying out some tests and seeing that each tool produces a different fuzzy match score for the same test example.

So what are the factors that influence these variations in fuzzy match scores?

Common factors that influence fuzzy match score calculations include:

Word order
Punctuation
Stop words
Partial substring matches
Formatting and tags
Matches longer than the source segment

Word order

When determining the fuzzy match percentage it is important to take into account the word order. Take for example the two segments below ^[1]. If word order is ignored, these segments would be an exact match (each segment has the same 4 words and one question mark). However, it is clear to see that this should not be an exact match. The segments have very different meanings.

Will you do it?
Do you will it?

Punctuation

When calculating a fuzzy match, should punctuation be:

ignored?
weighted less than a regular word?
weighted the same as a regular word?

Punctuation is a difficult issue. Sometimes punctuation is not important in determining whether a translator can leverage a segment.

John went to the store. (source segment)
John went to the store (translation memory segment)

Other times punctuation can completely change the meaning of the segment, and thus ignoring the punctuation may have negative consequences. For example ^[1]:

Woman, without her man, is helpless. (source segment)
Woman! Without her, man is helpless! (translation memory segment)

Stop words

Should stop words be given the same weight as other words, or should their weight be adjusted?

Take the following example ^[1]:

Description of the Service and Definitions. (source segment)
Ownership of the Services and Marks. (translation memory segment)

In the above example, if stop words are weighted the same as regular words we might end up with a fuzzy match score of 65%. 3 of the 6 words are the same and 1 word has one additional character (Service vs. Services). Thus 4/6 = 66% minus a 1% penalty for the additional character is one potential way the above fuzzy match score could be calculated. If we ignore stop words it quickly becomes more obvious that the above two segments are not a useful match.

Partial substring matches

One method of calculating a fuzzy match score could be to think of each individual word as a unit that can have only two states - it is either a match or it is not a match.

Thus Service and Services would not be a match.

Another method would be to examine the words that are not a match even deeper and see how close they might be (i.e. are they a true 0% match, or could there be a partial substring match). This involves using techniques similar to how spell checkers work. Techniques such as the Levenshtein distance that look at how many single-character edits (insertions, deletions or substitutions) it takes to change one word into the other.

In our example above it takes one insertion to change Service into Services. Thus, instead of treating words as a binary (match or not match) a potential fuzzy match algorithm could give a weight to words that show a partial substring match.

Formatting and tags

Many CAT and TEnT tools use the concept of tags to encode information about the format or structure of a word, phrase or segment.

Formatting and tags are another area where different fuzzy match algorithms might produce different results.

Take for example the following two segments. Should they be treated as a 100% match, or should there be a penalty applied as the formatting is not exactly the same?

John went to that store. (source segment)
John went to that store. (translation memory segment)

Matches longer than the source segment

Inevitably there will be cases where the source segment is an exact match to a substring of a segment in the translation memory. For example:

John went to the store. (source segment)
It was Friday and Betsy needed some eggs so John went to the store and picked up a dozen for her. (translation memory segment)

How should the above be treated? Although there is an exact match within the translation memory segment, there is still work that the translator needs to do to extract that information. Should this be a 100% match (probably not), a 99% match (one may argue), or something else?

Other difficulties and trade-offs

Performance

When you are working in your favorite CAT or TEnT tool, the performance (speed) of that tool is very important. Especially as the size of your translation memory grows, the performance of a fuzzy match algorithm can have a big impact on your workflow.

This is one thing to keep in mind when evaluating different options for fuzzy match algorithms. It might be worth forgoing some "accuracy" (i.e. is this a 67% or a 60% match) for speed.

Language differences

What works for English or Spanish or French may not work for Chinese or Japanese or Korean. A generic fuzzy match algorithm might have cost or performance benefits, while a solution tailored by language might be more accurate, but be more expensive to develop.

Bringing it all together

There are different factors to take into account when thinking about an ideal fuzzy match algorithm. Additionally, the ideal algorithm with respect to accuracy might not be the ideal algorithm when thinking about performance on today's hardware.

Over the coming weeks and months TM-Town will be building out more functionality to search and analyze your translation memories and termbases. Once implemented, I will follow-up in the future with a more detailed description of how TM-Town scores fuzzy matches. As with other parts of TM-Town (such as rates), I believe transparency is important to build trust within the community.

If you'd like to be notified of future blog posts, please sign up for TM-Town's mailing list. If you are a translator, give TM-Town a try today - it is free to register.

Comments (0)

If you would like to leave a comment please sign in to your TM-Town account. If you are not a TM-Town member you can easily register for a free account.

The Fuzziness of Fuzzy Matches

A little background

Types of matches

Fuzzy Match (1% - 99%)

Exact Match (100%)

Context Match (101%)

Other terms you may come across

Fuzzy Match Threshold

Internal Fuzzy Match

Fuzzy Match Repair

Factors that may influence a fuzzy match calculation

Word order

Punctuation

Stop words

Partial substring matches

Formatting and tags

Matches longer than the source segment

Other difficulties and trade-offs

Performance

Language differences

Bringing it all together

More reading

Comments (0)