Back here I was talking about the subtleties of SGML. Well, it turns out, I was wrong in thinking that our SGML lexer wasn’t up to scratch. It was working perfectly well!
The actual problem was, that we were being presented with a file that contained only <!ENTITY...> declarations, which isn’t valid SGML. It needed to be within the context of a <!DOCTYPE...> subset for the file contents to be valid. The file was being called from the main .book Docbook file, referenced via a parameter entity %textents; that was in the doctype subset, so when read in the context of the .book file, this was all perfectly legal and the lexer happily processed the file.
Of course, that doesn’t help our TM system, which isn’t smart enough – I guess Norm and Tony were dead right – writing something to process SGML from scratch is quite an undertaking. Thankfully, we don’t encounter cases like that too often, so a workaround might be possible for this book (pubstool id 817-4414 if anyone’s interested ?)
We’ve got another non-conformant lexer that we could use in place of the strict sgml one, which can chew on pretty much anything (we needed it when writing the html filter – it’s shocking the amount of invalid html we have to process) so whenever the strict sgml parser throws errors on encountering invalid files, we give the non-conformant one a go on the same input. Does anyone have any better ideas (given that we’re constrained by the fact that we want to continue to process a file at a time, rather than read the whole book, resolving entity references whenever we come across them)