<!SGML “ISO 8879:1986”

Back in ’97 Jon Bosak came to visit the Dublin Sun office to give a talk on a new-fangled thing called “XML”. At the time I was quite underwhelmed by it all. I think I’d only ever really encountered SGML before via HTML, and perhaps little glimpses of it through Arbortext Adept and Solaris Answerbooks. I was young and naive enough to think that really XML wasn’t all that different than SGML and wasn’t terribly special. I was wrong.

Somewhere in the middle of Jon’s very interesting talk, he held up Part 1 of the XML Specification – a copy of which is still on my bookshelf (nice red cover and only 33 pages long). Jon suggested that if he threw this at someone, it probably wouldn’t hurt, in comparison to the SGML specification, which would most likely put the target in hospital for a while.

Now fast-forward six years, to when I was helping to implement SGML support for our translation memory system. I needed to get my hands on a copy of the original SGML specification, which is contained in Goldfarb’s “The SGML Handbook”. It’s 688 pages long and very very complex. I had some extremely interesting email exchanges with Tony Graham (a hardened SGML and XML hacker) from the Sun XML Technology Group about some nuances in SGML I never knew existed. I suggested that the book was complex, and he replied with, “Yes, even the index is confusing”. Too right – I mean, did *you* know you can redefine < and > ?

Our TM system needed a filter to convert Docbook SGML into XLIFF, and for that, we needed to write a quick-and-dirty sgml lexer : just enough to be able to tell what’s a tag, what’s an entity reference, basic support for marked sections, that sort of thing, so we could segment the text in an tag-sensitive manner. That is, the text : "This is <emphasis>text. This is a new sentence</emphasis>." should be chopped into the two segments "This is <emphasis>text.</emphasis>" and "<emphasis>This is a new sentence</emphasis>.". We then can look up each segment in the translation database, and return the correct translation for each sentence.

Well, thanks to the richness of the SGML standard, writing this filter was extremely complex, and took a long time to get right – without my colleague John’s help and Tony’s sage-like advice, I don’t know if we’d have managed it. However, today we came across a bug in it. The lexer/parser part of the filter is written javacc, and so far has been able to process any SGML we’ve thrown at it, but today, it croaked on the perfectly valid SGML :

<!ENTITY HWCollection "<citetitle>Sun Cluster 3.x Hardware Administration Collection</citetitle>">

(of course this is valid XML as well, right ?)

So, the point of this post is just to say a very big “Thanks” to Jon for XML, on behalf of all the poor souls like me who’ve tried and failed to write things that process SGML, we’re not worthy !