We’re nearing the end of this series on how to write a translation memory system. At this stage, you should probably have gathered that they’re really not all that hard to write – at least, the concepts involved are simple enough. Sometimes though, it’s possible to get caught up in implementation details – previous components I’ve talked about were fairly straight forward, there’s only so many ways you can implement an docbook-to-xliff filter, for example. This time, I’m going to be talking about the TM lookup system, and that’s where feeping creaturism can really take over (in fact, that subject alone would be worth covering in a future log entry)
The TM lookup system
So, having built a segmenter, a set of file type-specific segmenters and got yourself a really efficient database schema one of the final things you need to implement, is something that can do an intelligent search of new documents for translation against your translation memory.
The essence of this is pretty simple, you’re just going to be doing a set of string lookups against a database. Take the input document which has been presented for translation and chop it into sentences using the same segmenter algorithm and file-type specific segmentation you’ve written. Once you’ve done that, you can search for each of those source-language strings in your database. When you find exact or fuzzy matches in the database, simply add the localised version of each segment (also stored in the database, remember) to your output document – I’ll talk about the output document format in my final log entry in this series, since it probably could use some additional explanation.
I’ve glossed over a few things there though. What do I mean by “exact or fuzzy matches” ? Well, while most string queries against a database are made looking for an exact match of the input string in the database, a TM system really comes into it’s own due to the fact that it can find fuzzy matches from the strings in the datbase. That is, if your input string is :
I went to the shops yesterday to buy some eggs
you might expect to find the following matches :
I went to the shops yesterday to buy some eggs
I went to the shops yesterday to buy some butter
I went to the big shop yesterday to buy an egg
He goes to the shops often to buy some eggs
– you get the general idea. Exact matches, sometimes called 100% matches are straightforward enough, but what exactly do you classify as a “fuzzy match” and how should you determine what’s fuzzy and what’s not ? It’s helpful to take a step back and think about why we’re looking for these matches from the system.
The ultimate aim, is to make life easier for the translator. So, what we’re trying to do, is suggest translations of text that closely matches the input text, which in turn will need as few changes as possible to be correct translations of the queried string. A number of factors influence the “fuzzyness” of a translation – certainly how close the source-language text is, but also formatting differences should be taken into account. For example, based purely on string length, the strings :
this is a <b>sentence</b> so there.
this is a <a href="http://www.webster.com/cgi-bin/dictionary?va=sentence"> big sentence</a>, so there.
would be very different, but if you’ve done your job properly during the segmentation phase, you realise that the strings mostly differ by some formatting, so perhaps you would want to make these strings closer match candidates than you would based on string length alone.
Thankfully, in the last decade, a new science has emerged that’s been looking very carefully at matters like these : Bioinformatics. Where these guys are interested in finding strings of DNA sequences that happen to match quite closely (but perhaps not exactly) we’re after almost exactly the same thing… My esteemed colleague JohnC has a copy of a book that I’ll point you to at some stage that explains the process quite well. I have to admit, I haven’t learnt enough about this field of technology, but I believe it could be really beneficial to folks writing applications like the ones I’ve been describing. Perhaps more people from translation tools companies should be attending conferences like SPIRE ?
So, not looking a gift horse in the mouth, we’re applying a mixture of both techniques to give us pretty fast, accurate fuzzy search against a translation memory database. From these string-queries, you can then narrow the search using the metadata you’ve stored in the database, perhaps you’re only interested in translations from a particular product group or ones done in the last 10 months (this is where we start hearing cries of “feetch, feetch!” from the little monster on our shoulders, so maybe here’s a good time to stop for today.)
 the feature creature :-)