At Sun, and now Oracle, we have a server called Jurassic. The machine was previously hosted in MPK17, the office in Melo Park, CA where many of the Solaris kernel developers worked (now occupied by a bunch of kids improving the lot of humanity through careful application of advertising technology) It now lives in the Santa Clara offices, where many of us moved to.

Every two weeks, jurassic is updated to the latest development builds of Solaris. Less frequently, it gets a forklift upgrade to more recent hardware to improve test coverage on that platform. The “Developing Solaris” document has this to say about jurassic:

You should assume that once you putback your change, the rest of the world will be running your code in production. More specifically, if you happen to work in MPK17, within three weeks of putback, your change will be running on the building server that everyone in MPK17 depends on. Should your change cause an outage during the middle of the day, some 750 people will be out of commission for the order of an hour. Conservatively, every such outage costs Sun $30,000 in lost time [ed. note from timf: I strongly suspect this is lower now: newer jurassic hardware along with massive improvements in Solaris boot time, along with bootable ZFS means that we can reboot jurassic with the last stable Solaris bits very quickly and easily nowadays, though that’s not an excuse to putback a changeset that causes jurassic to tip over] — and depending on the exact nature of who needed their file system, calendar or mail and for what exactly, it could cost much, much more.

If this costs us so much, why do we do it? In short, to avoid the Quality Death Spiral. The Quality Death Spiral is much more expensive than a handful of jurassic outages — so it’s worth the risk. But you must do your part by delivering FCS quality all the time.

Does this mean that you should contemplate ritual suicide if you introduce a serious bug? Of course not — everyone who has made enough modifications to delicate, critical subsystems has introduced a change that has induced expensive downtime somewhere. We know that this will be so because writing system software is just so damned tricky and hard. Indeed, it is because of this truism that you must demand of yourself that you not integrate a change until you are out of ideas of how to test it. Because you will one day introduce a bug of such subtlety that it will seem that no one could have caught it.

And what do you do when that awful, black day arrives? Here’s a quick coping manual from those of us who have been there:

  • Don’t pretend it didn’t happen — you screwed up, but your mother still loves you (unless, of course, her home directory is on jurassic)
  • Don’t minimize the problem, shrug it off or otherwise make light of it — this is serious business, and your coworkers take it seriously
  • If someone spent time debugging your bug, thank them
  • If someone was inconvenienced by your bug, apologize to them
  • Take responsibility for your bug — don’t bother to blame other subsystems, the inherent complexity of Solaris, your code reviewers, your testers, PIT, etc.
  • If it was caught internally, be thankful that a customer didn’t see it [ed. note from timf: emphasis mine – this is the most important bit for me]

But most importantly, you must ask yourself: what could I have done differently? If you honestly don’t know, ask a fellow engineer to help you. We’ve all been there, and we want to make sure that you are able to learn from it. Once you have an answer, take solace in it; no matter how bad you feel for having introduced a problem, you can know that the experience has improved you as an engineer — and that’s the most anyone can ask for.

So, naturally, my home directory in CA is on jurassic, and whenever I’m using lab machines in California, I too am subject to whatever bits are running on jurassic.

However, I don’t live in California – I work remotely from New Zealand, and as good as NFSv4 is, I don’t fancy accessing all my content over the Pacific link.

I strongly believe in the sentiment expressed in the Developing Solaris document though, so my solution is to run a “mini-jurassic” at home, a solution I expect most other remote Solaris developers use.

My home server was previously my desktop machine – a little 1.6ghz Atom 330 box that I wrote about a while ago. Since Oracle took over, I now run a much more capable workstation with a Xeon E31270 @ 3.40GHz, a few disks and a lot more ram :) Despite the fact the workstation also runs bits from bi-weekly builds of Solaris, it doesn’t do enough to even vaguely stress the hardware, so when I got it at the beginning of the year, I repurposed my old Atom box as a mini-jurassic.

Here are the services I’ve got running at the moment:

ZFS

… well, obviously. The box is pretty limited in that it’s maxed out at 4gb RAM, and non-ECC ram at that (I know – I’ll definitely be looking for an ECC-capable board next time, though I haven’t looked to see if there are any mini-ITX, low-power boards out at the moment)

With only three disks available, I use a single disk for the bootable root pool and a pair of disks, mirrored, for the main data pool. I periodically use ZFS to send/recv important datasets from the mirror to other machines on my home network. I suspect whenever I next upgrade the system, I’ll buy more disks and use a 3-way mirror: space hasn’t been a problem yet, the main data-pool is just using 1.5TB disks, and I’m only at 24% capacity.

Certain datasets containing source code also have ZFS encryption enabled, though most of the code I work on resides on non-encrypted storage (because it’s all still open source and freely available)

I run the old zfs-auto-snapshot service on the system so that I always have access to daily, hourly, and every-15-minute snapshots of the datasets I really care about.

NFS
I serve my home directory from here, which automounts onto my laptop and workstation. It also shares to my mac. Whenever I have to travel, I use ZFS to send/receive all of the datasets that make up my home directory over to my laptop, then send them back when I return.
CIFS
The windows laptop mounts its guest Z: drive via the CIFS server sharing a single dataset from the data pool (with a quota on that dataset, just in case) This is also shared to my mac.
An Immutable Zone
Immutable zones are a new feature in Solaris 11. I have a very stripped-down zone, which is internet-facing, running FeedPlus, a simple cron-job that runs a Python script and a minimal web-server. The zone has resource-controls set to give it only 256mb of ram to prevent it from taking over the world. I really ought to configure Crossbow to limit bandwidth as well.
A read/write zone

The standard flavour of zones have been around for a while now. This runs the web server for the house, sharing music and video content. All of the content actually resides in the global-zone, but is shared into a zone using ZFS clones of the main datasets, which means that even if someone goes postal in the zone, all of my data is safe.

The zone also runs my IRC logger for #pkg5 on Freenode (helpful when you work in a different timezone)

IPS updates

The system gets upgraded every two weeks, creating a new boot environment both for the zones as well as the global zone. It updates through a caching HTTP proxy which runs on my workstation, helping to further minimise bandwidth when I update all of my local machines once new bits become available (though IPS is already pretty good at keeping bandwidth to a minimum, only downloading the files that change during each update)

I tend to run several other stable and experimental bits and pieces on my home systems, both on the little Atom box, as well as my workstation. These mostly relate to my day-job improving IPS in Solaris, and those have already proved to be worth their weight, both in terms of shaking bugs out, as well as making my life a lot easier as a remote worker. I hope to write more about some of those in a future post sometime.

As more capabilities get added to Solaris, as with the jurassic server in California, I try as much as I can to find ways to exercise those new bits, because as it says on the jurassic web page:

Every problem we find and fix here is a problem which a customer will not see.

and that’s a good thing.