Micah was in touch recently, saying that our segmentation of Japanese text in the filters wasn’t very good. That was fair enough : really the segmenters were really only ever tested for English text, and I didn’t think they’d do a good job for any other language. In a vague attempt to handle non-English scripts, I did scour the Unicode character list, and include anything that sounded like it might indicate a sentence boundary : unfortunately, I got it wrong : Middle-Dot in Katakana isn’t a sentence separator. Just shows how much Japanese I know !
Today, I put back some changes and mailed our dev@ list to see if these were okay – can anyone else help to test Japanese segmentation ? I’m not a Japanese speaker (though I did try learning it a few years ago : didn’t have enough time or inclination to continue, unfortunately) so I’m a bit in the dark here…
Now, I had thought that using the default BreakIterator for the Japanese locale would be enough for doing sentence segmentation in Japanese, but other folks were saying this wasn’t so good : can anyone explain more ? (I’m using that BreakIterator to produce more accurate wordcounts for Japanese now, by the way)
Oh, thanks to Monma for unwittingly supplying me with some convenient Japanese text I used to see if the segmenter worked – any more opinions would be gratefully received!