Over the weekend, we deployed five new filters to our production TM system. They being :

  • .po – gettext() message files
  • .msg – catgets() message files
  • .properties – java properties
  • .java – ListResourceBundle

The first four are fairly self-explanatory : we can now convert these file formats to XLIFF and perform lookups of the messages included in those files against our translation memory system, converting back to the original format once translation has been completed.

The last one requires a bit of explanation. Despite the fact that our system already deals with XLIFF natively (without conversion to any other format) we’re still tied down by the type of segments in our database. As I mentioned here one of the things you have to decide fairly early on when implementing a TM system, is what sort of segments you’re going to work with. Obviously, if you have sentences in your database, and the segment you’re looking up is a paragraph, you’re not going to get many matches. Likewise, if you have a datbase full of paragraphs, searching for individual sentences is going to be tricky (possible, yes – but then you need to do some fairly complex alignment work, to decide which sentence in the resulting localised paragraph corresponds with that source-language sentence).

So, what do you do when presented with an XLIFF file that has been produced at paragraph-level segmentation ? We were faced with that exact decision… The folks working on StarOffice already have a large database of translations, all paragraphs of text in their documentation, online help and software. They were always able to perform simple exact matches on this database, producing partially translated documentation that translators could complete, but couldn’t do fuzzy matches on it. To make matters worse, since they’d chosen paragraph-level segments, if even a single sentence changed from release to release, they’d have to send the whole paragraph out for translation.

So, our solution is to take single-language XLIFF documents exported from their database as paragraphs and segment them further down into sentences, get some matches (both exact and fuzzy) from our TM system, send the partially translated files out to translators. When we receive the completed translations we can then convert the text back into paragraphs and return the text back to the StarOffice database. Simple really. Here’s hoping it’ll make a big difference to the efficiency of the internal StarOffice translation process. If things all work out here, it’d be great to try our hand at OpenOffice translations as well – I wonder would the folks working on OpenOffice be interested in our tools ?