Many in the translation industry seem to think that sentence segmentation is a solved problem; however, I tend to disagree. While there are tools that can obtain high accuracy in specific languages and specific domains, there still does not exist a free, open-source tool that can handle many languages as well as handle ill-formatted content across any incoming domain.
On TM-Town translators upload documents across many different fields of expertise and many languages. As such, it is important that TM-Town's segmentation engine be able to handle many different languages as well as deal with potentially ill-formatted content (i.e. text imported from PDFs often has line-breaks that fall in the middle of sentences). To deal with these issues I have worked on developing a new sentence segmentation engine.
As TM-Town benefits from a lot of open-source technology I have decided to open source TM-Town's sentence segmentation library in hopes that it might benefit the community. TM-Town’s segmentation tool is called Pragmatic Segmenter and is a rule-based sentence boundary detection library written in Ruby.
The goal of Pragmatic Segmenter is to provide a "real-world" segmenter that works out of the box across many languages and does a reasonable job when the format and domain of the input text is unknown. Pragmatic Segmenter does not use any machine-learning techniques and thus does not require training data.
According to Wikipedia, sentence boundary disambiguation (aka sentence boundary disambiguation) sentence segmentation is defined as:
Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. However sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address – not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang. Languages like Japanese and Chinese have unambiguous sentence-ending markers.
Translation memory data is typically stored and exchanged at the segment level. Poor segmentation will at best give the translator extra work within their CAT tool of choice to fix the segmentation issues manually and at worst will reduce the usefulness of a translator’s translation memory.
Although segmentation is not a "sexy" problem, it is very important and the base of many other NLP functions and tasks (e.g. machine translation, bitext alignment, named entity extraction, part-of-speech tagging, summarization, classification, etc.). As segmentation is often the first step needed to perform these NLP tasks, poor accuracy in segmentation can lead to poor end results.
TM-Town’s sentence segmenter uses a rule-based method that is made to work out-of-the-box across many languages. Users of the library do not need to manually create any rules, all the rules are already set up. Additionally TM-Town's library does not require any training data, which can be typical for libraries that utilize machine learning methods.
Therefore, I created a set of distinct edge cases to compare segmentation tools on. As most segmentation tools have very high accuracy, in my opinion what is really important to test is how a segmenter handles the edge cases - not whether it can segment 20,000 sentences that end with a regular word followed by a period. These example tests I have named the "Golden Rules". This list is by no means complete and will evolve and expand over time. To view the Golden Rules visit TM-Town's Natural Language Processing resource page.
The Holy Grail of sentence segmentation appears to be Golden Rule #18 as no segmenter I tested was able to correctly segment that text. The difficulty being that an abbreviation (in this case a.m./A.M./p.m./P.M.) followed by a capitalized abbreviation (such as Mr., Mrs., etc.) or followed by a proper noun such as a name can be both a sentence boundary and a non sentence boundary.
The results of how various popular segmentation tools faired against the Golden Rules test set can be found in the table below:
|GRS (Other Languages†)
† GRS (Other Languages) is the total of the Golden Rules listed above for all languages other than English. This metric by no means includes all languages, only the ones that have Golden Rules listed above.
‡ Speed is based on the performance benchmark results detailed in the section "Speed Performance Benchmarks" below. The number is an average of 10 runs.
In my opinion sentence segmentation is not yet a solved problem, but hopefully TM-Town can help to continue pushing the community forward by working to improve segmentation accuracy across many languages. It is certain that more work still needs to be done to further improve accuracy rates across all languages and to add support for even more languages.
Give TM-Town's segmentation engine a try by adding some text below that you would like segmented. If you find any text that does not segment the way you would expect, please let me know by opening an issue. If you are a translator, be sure to try out TM-Town - it's free. You'll be able to see the results of the segmentation engine at work when you upload a document into TM-Town's system.