It’s been a long time since I’ve talked about my actual day job – I used to write about it
all the time
when I worked in the G11n group, but these days, I tend to just write about ZFS, and not necessarily my contributions to it from a testing perspective.

Disclaimer: as a QE engineer, it’s really difficult to know how to write about software testing – just doing so acknowledges that our software has bugs. So, can we just take the fact that software has bugs as a given, and move on?

Hint: your software has bugs too, so look out!

I started working in ZFS-test back in October 2005 and it’s kept me really busy trying to keep on top of the performance improvements, feature additions and bug fixes that have been rolling in on a fairly constant basis from the developers since then. Life is far from dull. What’s more, with every bug comes a little more knowledge about how ZFS works – I’m still learning a new thing every day, and that’s good, I like learning new stuff.

Last week, I was doing what every good test developer should be doing – expanding our test coverage, and that’s what I wanted to write about here.

As you can see from our released the ZFS Test Suite we’ve got a lot of coverage of the functional areas of ZFS, but there’s always places where we can improve test coverage.

I was concentrating last week (or trying to, more on that later) on increasing the level of coverage for functional testing of ZFS commands when run as non-root users. With the advent of ZFS delegated administration, along with the specific tests for that feature, we wanted to double-check that we’re doing the right for non-root users of ZFS utilities in a general sense – ensuring that they don’t have any more permissions than they should, so it made sense to build-out our user tests. I’m thankful to say, the ZFS utilities are looking pretty solid here. Yeah!

It goes without saying, that just because we’re only now writing those tests, it doesn’t mean that non-root use of ZFS utilities has never been tested, remember these are just the formal automated-test cases: for every test case we write, there’s a great many people using and running ZFS in ways which we haven’t formally tested – in fact, just last week, I had a bug rear it’s head on my desktop machine which our test suite hadn’t caught! :-(

As a test development team, every time this happens to us, we feel a bit inadequate – it’s a fact of life that we can’t catch all the bugs of course, but that doesn’t make us feel any better.

This particular bug was an OS panic – which, I should say first, we haven’t seen on a production version of ZFS. Needless to say, this is somewhat distracting when it happens to your desktop (Windows blue screens, anyone?) – in fact, I think it might be the first time I’ve seen ZFS code that made it back into the Nevada gate actually panic my own machine. Ouch.

As every good developer should, I’m running the latest bits on my desktop, hoping to a) avoid the Quality Death Spiral, and b) expand coverage of ZFS testing.

As I said, our test suite covers a lot of functionality but it has bounds to the extent of what it can do to setup particular scenarios. In this case, the question was, how do you reproduce a particular bug that I’ve only been able to reproduce so far on a well-exercised, 30gb pool with filesystems (most created back in May 2006 – lots of churn since then) containing my test development & OS development workspaces, home-directory backups, lots of BFU archives from developers, iPod contents, usr/local & blastwave bits, etc. amounting to 26 filesystems and ~800 snapshots which are taken and deleted at constant 10-minute, daily, weekly and monthly intervals ?

Even if we could generate the right set of circumstances to make this bug appear, but it took 4 days of constant filesystem activity just to setup the test case, we wouldn’t be able to roll it into the test suite – as it’d impact the runs of the rest of the suite, so getting short, easy to reproduce tests are what we’re after. (though if such a test existed, we could run it periodically)

So far, we’re still trying to get a test case for this bug, but this also demonstrates why formal test cases aren’t a substitute for running your own software on your own machines, they’re a compliment to interacting with your code in as many ways as you possibly can. At Sun, as a company we have the “Sun on Sun” program, where we’re encouraged to run our own software in daily usage – indeed, you’ll see many folks on talking about this, for example, Chris’ series Good morning build…”. Likewise, a lot of the engineering folks (myself included) run with remote home directories on a machine called “jurassic” over in Menlo Park, which also runs ZFS and has also shaken a few more bugs out of the filesystem.

Finding and catching bugs before users do is the reason why we test developers exist, and we take a belt & braces approach to testing. Even if you’re not a test developer, you can still contribute to the effort – download your OpenSolaris distribution of choice, give it a spin and if you find anything amiss with ZFS, let us know via zfs-discuss or (the same goes for other technologies in OpenSolaris of course)

Your daily use of ZFS contributes to the quality of the filesystem, and that’s a good thing for everyone, so kick those tyres!