This is the next in the series of posts I’ve been writing about my role as ON technical lead for the Solaris organisation for the last 3.25 years.
What about maintaining productivity though? Providing new tools and processes is all very well, but how do we keep existing ones running and how do we communicate issues to the development community at large to let them know when they need to take action?
Heads up and flag day messages
Solaris engineering has long had a notion of formal “Heads up” and “Flag Day” notifications – each a slightly different form of organisation-wide email.
In a nutshell, a “Flag day” indicated something that a developer would have to take special action to deal with. Flag days were things like compiler version bumps, new pre-putback requirements or the addition of new source code repositories to the gate.
“Heads up” messages were things with less severity, for example mentioning that a major project had landed, or a new feature had arrived, a change in how the OS could be administered, or indeed, the addition of a change to the gate which fixes a build problem that developers could encounter in the future (more on that later)
My predecessor, Artem Kachitchkine did sterling work to further formalise these notifications, adding them to an Apache Solr index, broken down by biweekly build number. He also wrote a simple WSGI front-end so that the index could be searched easily from a web page and results displayed. Along with the heads-up and flag day mails, this also indexed any source-tree integrations that included an ARC (Architecture Review Committee) case, which typically indicated a larger body of work had just been pushed.
During my time as tech lead, we maintained this tradition, as it was clearly working well and developers were used to watching out for these messages.
Getting the quantity of heads-up messages right is an art: too few mass notifications, and you’ll spend time answering individual queries that would be better spent by writing a single announcement. Too many, and people will lose track or stop reading them. Of course, this didn’t stop the phrase “Read the damned flag day!” being thrown about from time to time, despite our best efforts!
Similarly, getting the content of the messages correct was also an art. The core team and gatekeepers would typically ask for a pre-review of any heads-up or flag day message before it was sent out. Partly this was to look for mistakes in the content, but also this was to make sure the message was concise and to the point, and to avoid having to send out subsequent corrections.
Common things we’d look for in these messages would be:
- In the mail, please list your bug IDs and ARC cases, explain what you’ve changed in the system, what system components have been affected and why the change constitutes a flag day.
- If only certain systems are affected or only machines with a particular piece of hardware are affected, please mention that.
- Show what any error from not following this flag day looks like – flag days are indexed, so including excerpts of error messages makes it easier for people to search for the relevant flag day.
- Explain how users recover from the flag day. For example, are incremental ON builds broken so that gatelings need to do a ‘make clobber’ before doing a build? Will Install of a kernel with the change continue to work on a pre-change system?
- List bug categories for any new feature
- Leave a contact email address.
So that explains how we’d communicate issues. How do we tell if issues are going to arise in the first place though? Enter Jenkins (again)
“Accelerated nightly builds”
I’ve mentioned before how Solaris is built into biweekly “milestone” builds, where we take packages built by every engineering group that contributes to the operating system, and produce a single image which can be installed or upgraded to.
For a long time, because of the complexity of the tools required to create such an image, this was the main way we built the OS – every two weeks, there’d be a new image and that was what got tested.
When Solaris 11 came along, with the new packaging system, and tools like Distribution Constructor, we made sure that the tools we used to build the OS were available to users too. This made it simple for engineers to produce their own entire OS images, though we found that not many actually did. Partly this was because it was just too inconvenient and time-consuming for all users to hunt down the latest packages from all engineering groups and hand-assemble their own image. Similarly, IPS allowed users to upgrade their systems from different sources at once, but again, it was inconvenient for users to configure their systems this way.
To make that easier, a group of us got talking to Solaris Release Engineering, and agreed to start assembling images nightly, with the very latest software from each engineering group – and thus “Accelerated Nightly” images came about. The terminology there is a little odd – we called it “accelerated” because Solaris RE already had “nightly” images, which they’d build for a few days leading up to their biweekly milestone build.
How does that help developer productivity? Well, in the ON consolidation, we created a Jenkins job to take these nightly images, install them on kernel zones and attempt to build our source tree. The most common source of breakage this found, was when upgrades to components in the Userland Consolidation, the engineering group which packages FOSS software for inclusion in Solaris, which the ON build had dependencies on caused the build to fail.
When we spotted breakage like this, we’d be able to quickly fix the build, integrate the fix (making sure that it also builds on non-accelerated-nightly systems) and send a heads-up mail to developers. Since most engineers would be building on machines running the most recent biweekly milestone build, this allowed us to document the potential future breakage, and the build would remain stable.
I know this doesn’t sound like rocket science, but it’s sometimes the simple things that make the most difference to a developer’s day – knowing that the build should be stable and compile cleanly all the time lets them concentrate on their changes, rather than worrying about what might go wrong the next time their build machine gets upgraded.
The accelerated nightly builds also contribute to “Running the bits you build, and building the bits you run” philosophy, and several engineers (myself included!) update their desktops daily to these bits, and enjoy playing with newly developed software without having to wait a whole two weeks before the next biweekly milestone is due to be released.