We came across one interesting bug in our software message file support this week and another one that wasn’t terribly interesting, just a bit dull.

The latter, less-interesting bug was that when we were writing context information into the XLIFF file, we were sometimes forgetting to escape the ‘<‘ ‘>’ and ‘&’ characters in the context elements that we were using to store software comments for translators and the message key field – unforgivable really :-(

The other bug, that was more interesting we thought we’d solved before, but it turned out to have been a regresssion (my fault). The original problem was, that we needed some way to store low-ASCII characters that are normally forbidden in XML files. For example, in one of the Solaris message files, we have lots of “^G” characters – ASCII BEL, so that the terminal beeps when a particular message is displayed. The question was, how could we represent these in XML ?

Well, doing a bit of a Google, and scouring the
XLIFF spec
, we found that it hadn’t been addressed, so after a little more hunting around, we came across the following, rather unpleasant solution (but hey, it works!)

<trans-unit id="a1176">
<source><?suntrans2-ascii-character 0007?><it id="1" pos="open">\n</it>
WARNING: Disabling duplicate IP address detection!\n\n
<it id="2" pos="open">\n</it><it id="3" pos="open">\n</it><
<count-group name="word count">
<count count-type="word count" unit="word">7</count>
<context-group name="message id">
<context context-type="record"><?suntrans2-ascii-character 0007?>\nWARNING:
Disabling duplicate IP address detection!\n\n</context>

The problem here of course, is that by using processing instructions to represent these characters, we’ve created an XLIFF file that deviates from the standard, and so is probably incomprehensible to tools that stick to the spec. Sorry ! If anyone has any ideas as to how we could do this, and stick to the standard, I’d love to hear about it.

The specific regression that we came across, was because I’d misunderstood how the SAX parsers generated by netbeans deal with processing instructions. When reading fragments like the above into our TM tool, I was using a processing-instruction handler and a characters handler for the <source> element, buffering the contents before using them. The problem was, that this would always put the control characters at the front of the string : shoddy coding really on my behalf : me being a moron. (I won’t bore you with the details, suffice to say, it’s all working now…)

While deploying to our internal production server, I got the chance to fix another few little niggly things that been annoying me for a while, so I’ve got a relatively clean slate now, I wonder what else I can break ?