A few years before my appointment to the tech lead job, several of us in the org had been complaining about the tools used to build the ON source tree.
During his time as ON gatekeeper, James McPherson had become frustrated about the Makefiles used in the build, which were gradually growing out of control. New Makefiles would get created by copy/pasting older ones, and as a result, errors in poorly written Makefiles would propagate across the source tree, which was obviously a problem.
The ON build at the time also used to deposit built objects within the source tree, rather than to a separate build directory.
While that was convenient for developers, it meant some weird Makefile practices were being used to allow concurrent builds on NFS-mounted workspaces (to avoid x86 and sparc from writing to the same filesystem location) and any generated sources (code generators such as lex/yacc) could accidentally be committed to the repository. So, another goal of the project was to change the build so that it wouldn’t attempt to modify the source tree.
Along with that, we had some pretty creaky shell scripts which drove the build, (primarily
nightly.sh) producing a single giant log file.
Worse still, the build was completely monolithic – if one phase of the build failed, you had to restart the entire thing. Also, we were running out of alphabet for the command line flags – seeing this getopts argument made my heart sink:
Something had to be done.
So, “Project Lullaby” was formed – to put “nightly.sh” to sleep!
James and Mark Nelson started work on the Makefiles, and I set to work writing
My targets were
nightly.sh, and its configuration file, often called
developer.sh, along with a script to produce an interactive build environment,
We chose Python as an implementation language, deciding that while we probably could have written another set of shell scripts, a higher-level language was likely required, and that turned out to be a great decision.
This work was begun as essentially a port of
nightly.sh, but as the tool progressed, we found lots of ways to make the developer experience better.
First of all, the shell scripts which defined the build environment had to go. “developer.sh” essentially just set a series of UNIX environment variables, but it didn’t do anything to attempt to clean up the existing evironment – this meant that two builds of the same source tree by different engineers could produce different results, and we ran into some nasty bugs that way.
Not being able to easily audit the contents of the
developer.sh script was also bad: since the configuration file was essentially executable code, it wasn’t possible to determine what the exact build environment would be without executing it, and that meant that it was difficult to determine exactly what sort of build would be produced by a given configuration file.
developer.sh with a Python ConfigParser file,
bldrc, and made build(1) responsible for generating the config file. This meant that the version of build(1) in the workspace could always produce its own config file, so we’d never have mismatched tools, where we’d build the workspace with the wrong version of the tools.
Since all bldrc files would generated the same way, it was easy to compare two files to see how the builds they would produce would differ, and easy to determine whether a build was ready to integrate (that is, all build checks had been run, and the build was clean)
Early on in the invocation of build(1) we would also verify that the build machine itself was valid: that the correct compilers were being used, that we have sufficient swap configured, etc. Of course we also have packages to pull in build-time dependencies that ought to be installed on all build machines, but a belt-and-braces approach resulted in fewer surprises – nothing’s worse than getting a few hours into a build only to discover that we’re using the wrong compiler!
Furthermore, we made sure that we’d complain about config files with values we didn’t recognise, and also removed almost all comments from the generated file, instead implementing a
build explain command to document what each variable did.
Finally, we included a
build regenerate command, to allow a new version of build generate a new config file from any older one, allowing us to upgrade from older versions of the tool, without necessarily needing to version the config file format itself.
For the interactive build environment, we wrote
build shell (aliased to
build -i), which produced exactly the same UNIX shell environment used by the rest of the build tool (before, nightly.sh and bldenv.sh could end up using different environments!) We made sure to properly sanitize the calling environment, passing through certain harmless, but important variables such as
Having taken care of the build environment and config files, most of the rest of build(1) defined a series of build ‘tasks’ – some of which are composite tasks, so “build nightly” does “build setup”, “build install”, “build check”, etc. (this was just using the Composite design pattern)
Each build task writes its own log file, and we used Python’s logging framework to produce both plaintext and syntax-highlighted HTML log files, each with useful timestamps, and the latter with href anchors so you could easily point at specific build failures.
To avoid overloading developers, we made sure that, with few execptions, all build tasks took the same command line arguments, to reduce the amount of cognative load on developers trying to learn how to build the source tree. Instead of adding arguments for slightly different flavours of a given command, we preferred to write a new build task (of course, using class-based inheritance under the hood)
Finally, we had a few “party tricks” that we were able to add in – build tasks which didn’t produce build artifacts, but instead provided useful features that improve ON developers’ lives – for example ‘build serve’ starts a simple ephemeral HTTP server pointing to the build logs in a workspace allowing you to share logs with other engineers who might be able to fix a problem you’re seeing.
Similarly, we have a ‘build pkgserve’ task, which starts up an IPS repository server allowing you to easily install test machines over HTTP with the artifacts from your build.
“build pid” returned the process ID of the build command itself, and since all dmake(1S) invocations were run within a Solaris project(5) we were able to install a signal handler such that that stopping an entire build was as easy as:
$ kill -TERM `build pid`
Finally, we added ZFS integration, such that before each build task was executed, we’d snapshot the workspace, allowing us to quickly rollback to the previous state of the workspace and fix any problems. This turned out not to be terribly useful by the time we’d shaken the bugs out of build(1) itself, but was incredibly helpful during its development.
One more artifact that was important, was the mail notification developers get when the build completes, and we spent time improving that format so that it was easier to determine what part of the build failed, and excerpted relevant messages from the build logs so users could tell at a glance where the issue was.
Here’s a mail_msg sample:
Build summary for server i386 nightly build of /builds/ongk/workspace/nightly.build.server Build status : pass (RTI issues) Started : 2017-08-09 01:02:34 Completed : 2017-08-09 03:37:57 Total time : 2:35:22 Build config : /builds/ongk/workspace/nightly.build.server/bldrc Build version : nightly.build.server-228:2017-08-09 Build pkgvers : 126.96.36.199.0.3.37799 . changeset : 5897c3a8526b (server) tip usr/closed cs : 61201571e908 tip usr/man cs : 5725f2ff08b3 tip usr/fish cs : [repository not present] Start timestamp : 1502265754.6 Build machine details: Identity : hopper (i386) WOS pkg : pkg://firstname.lastname@example.org,5.11-188.8.131.52.0.1.0:20170724T150254Z core-os pkg : pkg://email@example.com,5.11-184.108.40.206.0.2.37783:20170805T092540Z osnet pkg : pkg://firstname.lastname@example.org,5.12-220.127.116.11.0.125.0:20170530T145839Z === Task summary === server-tstamp : pass server-check.rti_ready: fail (non-fatal) server-clobber : pass server-tools : pass server-setup : pass server-install : pass server-packages : pass server-setup-nd : pass server-install-nd : pass server-packages-nd : pass server-generate_tpl : pass server-closed_tarball : pass server-pkgmerge : pass server-check : pass server-check.protocmp : pass server-check.elf : pass server-check.ctf : pass server-check.lint : pass server-check.cstyle : pass server-check.cores : pass server-check.findunref: pass server-check.paths : pass server-check.pmodes : pass server-check.uncommitted: pass server-check.install-noise: fail (non-fatal) server-check.wsdiff : pass server-check.lint-noise: pass server-update_parent : pass check.parfait : disabled === Task details === --- server-check.rti_ready --- (Check that this build config can be submitted for RTI) Starting task 'server-check.rti_ready' bldrc file: /builds/ongk/workspace/nightly.build.server/bldrc check.parfait was disabled One or more bldrc file settings were found that suggest this build is not ready for RTI Finishing task 'server-check.rti_ready' 'server-check.rti_ready' failed (non-fatal) and took 0:00:07.
At the time of writing, here are all of the build(1) tasks we implemented:
timf@whero build help -v Usage: build [subcommand] [options] build -i [options] [commands] build --help [-v] Subcommands: all_tests (a synonym for 'check.all_tests') archive Archive build products check Run a series of checks on the source and proto trees check.all_tests Run all tests known to the workspace check.cores Look for core files dumped by build processes check.cstyle Do cstyle and hdrchck across the source tree check.ctf Check CTF data in the non-debug proto area check.elf Run a series of checks on built ELF objects check.elfsigncmp Determines whether elfsigncmp is used to sign binaries check.findunref Find unused files in the source tree check.fish Do checks across the fish subrepo check.install-noise Looks for noise in the install.log file check.lint Do a 'dmake lint' on the source tree check.lint-noise Looks for noise in the check.lint.log file check.parfait Run parfait analysis on a built workspace check.paths Run checkpaths(1) on a built workspace check.pmodes Run a pmodes check on a built workspace check.protocmp Run protolist and protocmp on a built workspace check.rti_ready Check that this build config can be submitted for RTI check.splice Compare splice build repositories to baseline check.tests Run tests for sources changed since the last build check.uncommitted Look for untracked files in the workspace check.wsdiff Run wsdiff(1) to compare this and the previous proto area clobber Do a workspace clobber closed_tarball Generates tarballs containing closed binaries cscope (a synonym for 'xref') explain Print documentation about any configuration variable fish Build only the fish subrepo fish.ai_iso Build 'nas' Fish AI iso images only fish.conf Verify mkak options fish.destroy_dc Remove datasets tagged 'onbld:dataset' at/under 'dc_dataset' fish.gk_images Build Fish images appropriate for gk builds fish.images Build 'nas' Fish upgrade images only fish.install Build Fish sources, writing to the fish proto area fish.jds_ai_iso Build 'jds' Fish AI iso images only fish.jds_all_images Build all 'jds' Fish images fish.jds_gk_images Build 'jds' Fish images appropriate for gk builds fish.jds_images Build 'jds' Fish upgrade images only fish.jds_txt_iso Build 'jds' Fish text iso images only fish.nas_all_images Build all 'nas' Fish images fish.packages Build Fish IPS package archives fish.re_build Runs AK image construction tasks for Release Engineering fish.save_artifacts Save all build artifacts from the 'dc_dataset' directory fish.txt_iso Build 'nas' Fish text iso images only generate Produce a default bldrc configuration file generate_tpl Generate THIRDPARTYLICENSE files help Print help text about one or all subcommands here Runs a 'dmake ...' in the current directory hgpull Do a simple hg pull for all repositories in this workspace install Build OS sources, writing to the proto area kerneltar Create a tarball of the kernel from a proto area. nightly Do a build, running several other subcommands packages Publish packages to local pkg(7) repositories parfait_remind_db Generate a database needed by the parfait_remind pbchk pid Print the PID of the build task executing for this workspace pkgdiff Compare reference and resurfaced package repositories pkgmerge Merge packages from one or more repositories pkgserve Serve packages built from this workspace over HTTP pkgsurf Resurface package repositories pull Do a hg pull and report new changesets/heads qnightly Do a nightly build only if new hg changesets are available regenerate Regenerate a bldrc using args stored in a given bldrc save_packages Move packages to $PKGSURFARCHIVE as a pkgsurf reference serve Serve the contents of the log directory over HTTP setup Do a 'dmake setup', required for 'install' and 'here' test Runs tests matching test.d/*.cfg file or section names tools Do a 'dmake bldtools', for 'setup', 'install', and 'here' tstamp Update a build timestamp file update_diag_db Download a new copy of the stackdb diagnostic database update_diverge_db Generate AK/Solaris divergence database update_parent Update a parent ws with data/proto from this workspace xref Build xref databases for the workspace
I hope to talk about a few of these in more detail in future posts, but feel free to ask if you’ve any questions.
In the end, I’m quite proud of the work we did on Lullaby – the build is significantly easier to use, the results easier to understand and since the Lullaby project integrated in 2014, we’ve found it very simple to maintain and extend.
However after we integrated, I have a feeling the folks looking for a new ON tech lead decided to give me a vested-interest in continuing to work on it, and so, “Party Like a Tech Lead” began!