When writing blogs, people say you should write about what you know. So, since I’ve spent quite a long time working on translation tools and figure I know something about them, I thought I’d do exactly that – I’m starting a series of posts on how to write a translation memory tool. Hope this is of interest to some people out there ?

When we were looking at all the other systems on the market, hoping to “buy, not build”, it was surprising how many of them fell short of our requirements. In particular, the market leaders were at the time focused on the small translation office, all had client-side systems which didn’t scale to the size we were looking for (millions of segments all stored in one large rdbms). So, after a few false starts, we built our own.

The objective here, is to come up with a system that allows you to reuse the translations that you’ve done in the past. Technical manuals and software messages are prime examples of the sorts of document we’re interested in, especially when the a new version of that manual or software message file isn’t drastically different from the previous version.

We can break a TM system down into a few components :

1. a segmenter
2. file-type specific segmentation
3. a database schema
4. a TM search/lookup mechanism
5. Output formats

So, here’s the first thing you need (I’ll cover the rest in separate entries sometime)

1. A segmenter

A segmenter is an algorithm that runs over blocks of text, and chops it into segments which you can then store/search for from your translation memory (database). We use a sentence-level segmenter : that is, where each resulting segment is a sentence, but you could also choose to use paragraphs as your segments. The advantage of choosing sentences over paragraphs is that you can deal with a finer granularity of change. If the original author changes one sentence, ideally you should only have to retranslate one sentence. Of course, there’s a trade-off – perhaps by retranslating or rephrasing the other sentences in the paragraph that has the changed, you could end up with a better translation. However, it’ll take longer to do, so if time is a constraint, this might not be the best option. I’m not a linguist, so I can’t really argue the finer points of this but our translators seem to be quite happy with the sentence-level segmentation behaviour we use. One other thing to point out, is that the segment-size you choose has an impact in other parts of the system, particularly when choosing your search mechanism.

The segmenter should be language specific, and should be able to detect the segment boundaries that come up in your input documents. For example, it should have no trouble segmenting the text :

He can be reached at
john.smith@sun.com. The class was written to be part of the java.util.Collections framework.
He watched "E!" channel.
There is a Jos. A. Banks
clothes store on Newbury St.
He worked at Smith Corp.
"U.S. News reports that he
was a model employee," said
the anchor.
She prefers the title MS. and
don't forget it!
He saw a 600-lb. gorilla.
He flew at 500 m./h. and
really high.
He had a Ph.D. from UCLA.
His name was R. J. Smith,
Esq., and don't forget it.
I bought fish, fruits, etc.
Do you want some?
Use a period ("\.") to
indicate end of sentence.
With $15.7 billion in annual
revenues, Sun can be found in
more than 170 countries and
on the World Wide Web at

From the above, you can see that we have to deal with numeric values, abbreviations, quoted text and all sorts of sentence breaks. It goes without saying that we don’t want to ever loose any of the incoming text. The next thing that you need when writing the segmenter, is a way to have it ignore pieces of text that, perhaps, a previous text-processing component has identified as being special (for example, in the “… java.util.Collections …” section above, you may hve spotted a java class name that would normally trip up the segmenter, so you need to protect it somehow). All text that passes through your TM system, no matter what format it was in originally will pass through this component, so it’s worthwhile spending time on getting it right.

My example above was pretty difficult for most segmenters to deal with, especially the proper names (that elusive clothing store!). At this level of complexity, you’ll probably have to resort to full NLP techniques in order to parse the text and get the sentence boundaries correct, is it worth that much effort ? Not sure, but if I was writing one of these again, I might just look at something like this. However, typically when looking at computer documentation or software messages, you’re not dealing with plaintext directly. Input formats such as Docbook or HTML can give you very strong hints as to how the segmenter should behave, and that’s what I’ll deal with in the next part of this series. [ In particular, this is why the default sentence break java.text.BreakIterator isn’t enough ]