In my last blog entry, I had a footnote mentioning some glitches we ran into that made life hard when dealing with files from the StarOffice team. I thought it might be interesting to expand on what the problem is.

The initial problem we were trying to solve, was to allow the StarOffice team get fuzzy matches from their own material. While they’ve had their own DB of translations for a long time, they’ve only been able to get exact matches from it. Our solution, was to have them use our translation memory server, allowing them to get fuzzy translations against their own material (previously imported into our TM), but also to get matches from the translations done for other Sun product groups.

Unfortunately, we were dealing with different levels of segmentation – the StarOffice database contains paragraphs of text and our TM system works at the sentence level. So if we look up a whole paragaph of text against our database, we’ll get very few matches. Not good.

Our solution to this, was to have the StarOffice team export their text for translation as XLIFF – they’d just write out the paragraphs for which they got no 100% matches internally. We’d then take that XLIFF, and further sub-segment it into sentences, look up each sentence in our system, return fuzzy and exact matches (exact, because our database contains strings from products right across the Sun product line, not just StarOffice). We’d then send these partially translated files out to translation, receive the results and finally recombine the sentences back into paragraphs and return those files back to the StarOffice database.

The problem we encountered was subtle, and the fault of the implementation we used to get this sub-segmentation behaviour. Since we already had a good XML->XLIFF filter, we thought we could use that filter to convert the XLIFF files at paragraph-level-segmentation into XLIFF files at sentence-level-segmentation. As you know by now, the neat thing with XLIFF, is that it allows you to separate content from formatting. The trouble was, the StarOffice translators wanted to be able to see the original context information exported from the StarOffice database – which isn’t translatable, rather it’s used to help the translator, but our XML filter was dropping this important information (or at least, storing it in the skelton file, which our editor doesn’t look at or display to translators)

So, today, I’m going to re-write our StarOffice solution. Examining the XLIFF spec, there’s support for <group> elements, which can contain <trans-unit> elements. I think I should be able to re-write the paragraphs as groups of sentences, keeping the context information and hopefully making everyone happy. The only thing to worry about, will be what to do with mid-sentence whitespace… Hmm, must get some coffee and work more on that one.

Anyway, the point of this blog entry (yes, it really has one!) was to point out that just because you’re communicating using a common file format, it doesn’t mean that all your problems are over. Having two incompatible segmentation-levels in different TM systems can really throw a spanner in the works. On the plus side, I suppose, I get to use more of the XLIFF spec (we haven’t really done much with group elements to date)

There’s some work going on in the LISA OSCAR subgroup to try to address some of these problems – SRX is a file format that describes segmentation rules between translation tools. However, all it does is document different segmentation rules – given a TM at one level of segmentation and an SRX file that describes that format, you should be able to hand those to a different tool that understands SRX, changing the segmentation behaviour of that tool, allowing you to re-use that legacy TM, and avoid vendor lock-in.

However, it doesn’t help you at all, if you have a set of data segmented at one level, and a new set of data segmented at another level : you can get matches from one set, or the other – doing both is hard