This is the final part in my series on how to write a TM System. We’ve done all of the major components now – but I haven’t talked about one of the most important ones.

The output format

Now that you’ve got a method of taking an input document, splitting it up into sentences, looking up each sentence in a database of translations and returning exact and fuzzy matches, the final question remains – what do you do with all of this data ?

Well, you obviously need to represent this data to translators in a manner that will make it simple for them to review and complete the translations. Remember our aim here is to make life as easy as posssible for the translator, by suggesting translations and automatically re-using old translations, that’s certainly a step in the right direction. There’s one more thing though that we need to take care of. Consider the following :

click on image for a larger version

There’s a load of file formats represented there that a translator would need to be able to understand (can anyone name them all ?) and would need tools to process these formats. I don’t think that’s the best use of a translators time – I think a translator should be able to concentrate on the text that’s being translated, rather than being proficient at Frame, or Docbook or understand the nuances of every different XML format you throw at them. Having them edit the original file format means that they have to ensure that they don’t corrupt files, that the encoding is correct, that the file displays correctly (which may mean compiling software message files and building them into an application) – a host of things that perhaps could be done by people more experienced in the product being translated.

With this (and many other problems facing the translation industry) a small group of dedicated individuals from Sun, Oracle, Novell and some translation vendors came up with XLIFF, which has since been put under the umbrella of OASIS. XLIFF aims to solve the above problem of dealing with multiple formats, by abstracting the translatable text from the formatting of the document. With a little bit of work, it’s possible to come up with something like this which I’m sure you’ll agree is a little easier for the translator to deal with :

click on the image for a larger version

Of course, that’s only one side of the story. We also need to be able to backconvert XLIFF documents to their original file format, and would also like to be able to generate TMX files from the completed translations, so that we can then import them into our translation memory database, for use by other projects.

That’s it – with this work, we can really increase the productivity of our translators and save time and effort when producing localised products !

What’s more, now that we have a central database of translations, there’s all sorts of other interesting things we could do to increase efficiency even more. For example, perhaps we haven’t found an exact match for a segment, but based on the large amount of data we’ve accumulated, wouldn’t it be nice if we could search inside that segment for terms that have been translated elsewhere, and perhaps suggest those terms to the translator via the translation editor ?

How about machine translation – since we’ve got a large corpus, we might even be able to apply example-based machine translation techniques to translate things automatically.

I may well cover these in future posts, stay tuned!