I’ve been sitting here this weekend trying to motivate myself to do some study, and it’s not easy. Over on sun.com, there’s details of a beta they’re running for the Sun Certified Java Programmer Exam, which I thought I’d try my hand at. I’m not sure I completely agree with the worth of turning yourself into a java compiler for the sake of certification (after all, isn’t that what Netbeans has those squiggley red lines for ?) However, since the cost of taking the certification is free for Sun employees (and only $49 for everyone else!) and I’ll almost certainly learn stuff along the way, what have I got to lose ? The disadvantage is, that I have to get my skates on – as the beta closes on the 6th of March, so I’ve resolved to try and do some studying this weekend.

Here’s a problem though : I’m one of those people that’s easily distracted when I get tired. Usually, that’s a good time to take a break, and return to whatever I should be doing later (at least that’s how I’m justifiying my current bout of study-avoidance).

In my (ample) free time, over the last while, I’ve been messing about with writing a little media player for my laptop – just something that’s capable of playing music files copied over from my Mac. I don’t need to do any hardcore backend audio-decoding stuff, a simple GUI is all that I need to wrap around some existing libraries. Yes, there’s lots of media players out there that I could use instead, but this is more for kicks (and to get a bit more experience writing Swing apps and using the Netbeans forms editor : I’m a server-side guy most of the time, and rarely get to write a huge amount of GUI code)

Here’s the thing that was diverting me this morning : Having copied all of my music from the Mac (~9Gb of my own CDs) to my Solaris laptop, doing a simple ls -sF on the directory containing the artists showed the following :

timf@argentum[566] pwd
/home/timf/media/audio/iTunes Music
timf@argentum[567] ls
total 254
2 AIR/                         2 Karen Ramirez/
2 Adam F/                      2 Lauryn Hill/
2 Adrian Legg/                 2 MC Frontalot/
2 Aphex Twin/                  2 Manic Street Preachers/
... (skipping stuff for brevity)
2 Eagle Eye Cherry/            2 Talvin Singh/
2 Erasure/                     2 The Avalanches/
2 Erykah Badu/                 2 The Beach Boys/
2 Esthero/                     2 The Cure/
2 Everything But The Girl/     2 The Darkness/
2 Faithless/                   2 The Divine Comedy/
2 Fat Boy Slim/                2 The Killers/
2 Fatboy Slim/                 2 The Lemonheads/
2 Finley Quaye/                2 The Police/
2 Foo Fighters/                2 The Proclaimers/
2 Fountains Of Wayne/          2 The Sundays/
...
(and more of the same)

Now, according to a dictionary sort – the above is absolutely spot on, the directories are sorted A..Z just as you’d expect. However, this isn’t the way a record store would sort these artists. They’d be far more likely to ignore the “The” in some band names, and sort them under the first letter of the following words – so “The Police” would get sorted under “P”, right ? Well, this was getting on my nerves, so I resolved to do something about it.

On Solaris and most other modern UNIX implementations, you can use localedef to define a locale. This is what takes care of the way the system does collation (sorting). Have a look at locale for the gory details. Thankfully, on Solaris, we ship the source for some of our locales under /usr/lib/localedef/src so, armed with those, a text editor and a compiler, I set about defining a new locale : en_US.UTF-8@timf !

It turns out, there wasn’t much to it – all I needed to do, was define a series of collation elements that could tell the system about things that looked like “The …” in the localedef source file :

# Tim added this stuff
collating-element <TheA> from "<T><h><e><space><A>"
collating-element <TheB> from "<T><h><e><space><B>"
collating-element <TheC> from "<T><h><e><space><C>"
etc.

Having defined those, I then needed to define the sort routines to treat “The A..” as if it were “A”, “The B..” as if it were “B”, etc.

order_start forward;forward,position
# Tim's rules for sorting The differently, according
# to the way record stores do it
<TheA><A>
<TheB><B>
<TheC><C>
<TheD><D>
etc.

then finally, run localedef to build the locale source file, and compile the sources. I was running into problems with localedef trying to call gcc, so the -c flag here tells the system to produce a source file regardless, as soon as it started trying to compile the C source file, I hit Ctrl-C and manually compiled the locale :

timf@argentum[583] localedef -c -f charmap.src -x extension.src \
-i localedef.src en_US.UTF-8@timf
(output omitted)
^C
timf@argentum[584] gcc -o en_US.UTF-8@timf.so.3 -shared -fpic \
-L/usr/lib/locale/common -R/usr/lib/locale/common \
/usr/lib/locale/common/methods_unicode.so.3 localeUha4Id.c
(output omitted)
mv en_US.UTF-8@timf.so.3 /usr/lib/locale/en_US.UTF-8@timf/en_US.UTF-8@timf.so.3

Having copied the rest of the files from the standard en_US.UTF-8 support, hey presto, I’ve a new locale :

timf@argentum[537] export LC_ALL=en_US.UTF-8@timf
timf@argentum[538] locale
LANG=C
LC_CTYPE="en_US.UTF-8@timf"
LC_NUMERIC="en_US.UTF-8@timf"
LC_TIME="en_US.UTF-8@timf"
LC_COLLATE="en_US.UTF-8@timf"
LC_MONETARY="en_US.UTF-8@timf"
LC_MESSAGES="en_US.UTF-8@timf"
LC_ALL=en_US.UTF-8@timf
timf@argentum[539] ls
total 254
2 Adam F/                      2 Jean Michel Jarre/
2 add/                         2 Jean-Michel Jarre/
2 Adrian Legg/                 2 Jools Holland/
2 AIR/                         2 Juliet Turner/
2 Aphex Twin/                  2 Justin Bacon/
2 Apocalyptica/                2 Justin Timberlake/
2 Arrested Development/        2 Karen Ramirez/
2 The Avalanches/              2 The Killers/
2 Badly Drawn Boy/             2 l5-l6 puzzle/
2 Basement Jaxx/               2 Lauryn Hill/
2 BBC Radio/                   2 The Lemonheads/
2 BBC Radio 4/                 2 Manic Street Preachers/
2 The Beach Boys/              2 Mary Margaret O'Hara/
(snip)
2 The Cure/                    2 The Police/
2 Daft Punk/                   2 Portishead Parody/
2 D'Angelo/                    2 The Proclaimers/
2 The Darkness/                2 R.E.M./
2 David Gray/                  2 Rodrigo y Gabriela/
(snip)
2 Fat Boy Slim/                2 The Sundays/
(etc.)

Now, of course, this is probably not something most people would do (in particular, I haven’t thought a whole lot about the larger ramifications of my weird collation sequence) but my system is the only one that’ll sort this way. I’ve defined a new locale without consulting any in-country standards-bodies and have installed it on my system without doing a whole lot of testing. Furthermore, I haven’t told the windowing system (or anything else) about my new locale, but that’s okay, I’m only doing it to avoid studying anyway;-)

Having said that, here’s a small rant about “Platform architecture”. The work that I’ve done here will work for all programs that are well behaved and use standard system calls. That is, it only applies to programs that take their sorting routines using the system-supplied algorithms, eg. strcoll & friends (see the locale man page for more details, under the LC_COLLATE category). Programs like Java and OpenOffice.org are platforms themselves though, and have defined their own locale information, so my work won’t be valid when I run on those platforms. That bugs me. I understand their reasons for defining their own locale information, they being cross-platform applications which need to ensure common behaviour across different systems, but it doesn’t make my life any easier – if I wanted to complete the job of getting “record shop sort” on my desktop, I’d have to start into investigating how Java and StarOffice provide collation support (hint – this might help for the former, I’ve no idea about StarOffice/OpenOffice.org)

We’ve been running into similar issues in work, wrt. other bits of locale information, where programs elect not to use the system-defaults, deciding instead to include date information and other locale-sensitive formatting inside .po files (resulting in a potentially inconsistent desktop, with different programs having their own ideas about how to display locale sensitive information (currency, date formats, number formats, sorting, etc.))

I’ll end this (by now rather lengthy post, sorry) with a request : the next time you find you need to do something that’s locale-sensitive, please look at the support provided by the platform, and use that if at all possible : if the platform support isn’t sufficient for your needs, get in touch with your vendor and ask them to start enhancing it ! That’s the only way we’ll be able to move things forward…

Anyway, end of rant – that was an interesting diversion. I’d better get back to my studying !

Advertisements