In a previous post, I introduced the first thing you need in order to write a really good TM system. The good news is that having written the segmenter, I think your work gets easier from here on in :-)

File type specific segmentation

Using a segmentation algorithm usually isn’t enough to determine how best to segment incoming text. Just look at the html on this page – you can see that there’s lots of things that aren’t entire sentences, but should probably be translated separately from other pieces of text on the page. So, the thing to do, is have a filetype-specific layer that knows about the incoming file type, and use the formatting statements provided to help out your segmenter. For example, in the case of html, you might want to treat <h1>,<h2>,<h3>, etc. as complete segments. Perhaps you’d also like to consider inline markup, making sure that whenever you segment text in the middle of an inline element, you should propogate those inline elements across both segments, for example :

<b>This is a sentence. This is a new</b> sentence.

should result in the segments:

1. <b>This is a sentence.</b>

2. <b>This is a new</b> sentence.

Each file format has can have sections of text that could potentially trip up your segmenter, but if you’ve written your segmenter according to my suggestions, you’ll already have a way to protect certain pieces of text from segmentation. For example, in Docbook, the <programlisting> element tends to contain source-code examples : perhaps there’s translatable text in the middle of the programlisting (most likely a code comment, or a string-literal in the code) – but since your segmenter wasn’t written to segment Java or C++, you should probably take the entire program listing, and treat it as a segment.

At this stage, you should be thinking that you could do with a common file format that can contain both the translatable and non-translatable parts of the input document, so you can allow the translator to see only the translatable bits and so that the rest of your system doesn’t have to be change each time a new file format is introduced. Well, this is exactly the problem XLIFF was designed to solve. All of our filters are written to take an input document, work out what’s translatable and what isn’t, create segements from the translatable parts and produce XLIFF output. From this point on, we can deal in a common file format, and there’s not too many file-format-specific issues to worry about.

The eventual aim in all of this, is that you only create segments containing translatable text, and that you’re consistent in the way you handle your input format. Once you’ve done that, you can take each segment and look it up against your translation memory. For that though, you’ll need a database – and that’s what I’ll talk about in the next post in this series.