A few years before my appointment to the tech lead job, several of us in the org had been complaining about the tools used to build the ON source tree.

During his time as ON gatekeeper, James McPherson had become frustrated about the Makefiles used in the build, which were gradually growing out of control. New Makefiles would get created by copy/pasting older ones, and as a result, errors in poorly written Makefiles would propagate across the source tree, which was obviously a problem.

The ON build at the time also used to deposit built objects within the source tree, rather than to a separate build directory.

While that was convenient for developers, it meant some weird Makefile practices were being used to allow concurrent builds on NFS-mounted workspaces (to avoid x86 and sparc from writing to the same filesystem location) and any generated sources (code generators such as lex/yacc) could accidentally be committed to the repository. So, another goal of the project was to change the build so that it wouldn’t attempt to modify the source tree.

Along with that, we had some pretty creaky shell scripts which drove the build, (primarily nightly.sh) producing a single giant log file.

Worse still, the build was completely monolithic – if one phase of the build failed, you had to restart the entire thing. Also, we were running out of alphabet for the command line flags – seeing this getopts argument made my heart sink:

+ABCcDdFfGIilMmNnOPpRrS:TtUuWwxz

Something had to be done.

So, “Project Lullaby” was formed – to put “nightly.sh” to sleep!

James and Mark Nelson started work on the Makefiles, and I set to work writing build(1)

My targets were nightly.sh, and its configuration file, often called developer.sh, along with a script to produce an interactive build environment, bldenv.sh.

We chose Python as an implementation language, deciding that while we probably could have written another set of shell scripts, a higher-level language was likely required, and that turned out to be a great decision.

This work was begun as essentially a port of nightly.sh, but as the tool progressed, we found lots of ways to make the developer experience better.

First of all, the shell scripts which defined the build environment had to go. “developer.sh” essentially just set a series of UNIX environment variables, but it didn’t do anything to attempt to clean up the existing evironment – this meant that two builds of the same source tree by different engineers could produce different results, and we ran into some nasty bugs that way.

Not being able to easily audit the contents of the developer.sh script was also bad: since the configuration file was essentially executable code, it wasn’t possible to determine what the exact build environment would be without executing it, and that meant that it was difficult to determine exactly what sort of build would be produced by a given configuration file.

I replaced developer.sh with a Python ConfigParser file, bldrc, and made build(1) responsible for generating the config file. This meant that the version of build(1) in the workspace could always produce its own config file, so we’d never have mismatched tools, where we’d build the workspace with the wrong version of the tools.

Since all bldrc files would generated the same way, it was easy to compare two files to see how the builds they would produce would differ, and easy to determine whether a build was ready to integrate (that is, all build checks had been run, and the build was clean)

Early on in the invocation of build(1) we would also verify that the build machine itself was valid: that the correct compilers were being used, that we have sufficient swap configured, etc. Of course we also have packages to pull in build-time dependencies that ought to be installed on all build machines, but a belt-and-braces approach resulted in fewer surprises  – nothing’s worse than getting a few hours into a build only to discover that we’re using the wrong compiler!

Furthermore, we made sure that we’d complain about config files with values we didn’t recognise, and also removed almost all comments from the generated file, instead implementing a build explain command to document what each variable did.

Finally, we included a build regenerate command, to allow a new version of build generate a new config file from any older one, allowing us to upgrade from older versions of the tool, without necessarily needing to version the config file format itself.

For the interactive build environment, we wrote build shell (aliased to build -i), which produced exactly the same UNIX shell environment used by the rest of the build tool (before, nightly.sh and bldenv.sh could end up using different environments!) We made sure to properly sanitize the calling environment, passing through certain harmless, but important variables such as $DISPLAY, $HOME, $PWD, $SSH_AGENT_* etc.

Having taken care of the build environment and config files, most of the rest of build(1) defined a series of build ‘tasks’ – some of which are composite tasks, so “build nightly” does “build setup”, “build install”, “build check”, etc. (this was just using the Composite design pattern)

Each build task writes its own log file, and we used Python’s logging framework to produce both plaintext and syntax-highlighted HTML log files, each with useful timestamps, and the latter with href anchors so you could easily point at specific build failures.

To avoid overloading developers, we made sure that, with few execptions, all build tasks took the same command line arguments, to reduce the amount of cognative load on developers trying to learn how to build the source tree. Instead of adding arguments for slightly different flavours of a given command, we preferred to write a new build task (of course, using class-based inheritance under the hood)

Finally, we had a few “party tricks” that we were able to add in – build tasks which didn’t produce build artifacts, but instead provided useful features that improve ON developers’ lives – for example ‘build serve’ starts a simple ephemeral HTTP server pointing to the build logs in a workspace allowing you to share logs with other engineers who might be able to fix a problem you’re seeing.

Similarly, we have a ‘build pkgserve’ task, which starts up an IPS repository server allowing you to easily install test machines over HTTP with the artifacts from your build.

“build pid” returned the process ID of the build command itself, and since all dmake(1S) invocations were run within a Solaris project(5) we were able to install a signal handler such that that stopping an entire build was as easy as:

$ kill -TERM `build pid`

Finally, we added ZFS integration, such that before each build task was executed, we’d snapshot the workspace, allowing us to quickly rollback to the previous state of the workspace and fix any problems. This turned out not to be terribly useful by the time we’d shaken the bugs out of build(1) itself, but was incredibly helpful during its development.

One more artifact that was important, was the mail notification developers get when the build completes, and we spent time improving that format so that it was easier to determine what part of the build failed, and excerpted relevant messages from the build logs so users could tell at a glance where the issue was.

Here’s a mail_msg sample:

Build summary for server i386 nightly build of /builds/ongk/workspace/nightly.build.server

Build status    : pass (RTI issues)
Started         : 2017-08-09 01:02:34
Completed       : 2017-08-09 03:37:57
Total time      : 2:35:22
Build config    : /builds/ongk/workspace/nightly.build.server/bldrc
Build version   : nightly.build.server-228:2017-08-09
Build pkgvers   : 11.4.0.0.0.3.37799

. changeset     : 5897c3a8526b (server) tip
usr/closed cs   : 61201571e908 tip
usr/man cs      : 5725f2ff08b3 tip
usr/fish cs     : [repository not present]

Start timestamp : 1502265754.6

Build machine details:
Identity        : hopper (i386)
WOS pkg         : pkg://solaris/entire@11.4,5.11-11.4.0.0.0.1.0:20170724T150254Z
core-os pkg     : pkg://nightly/system/core-os@11.4,5.11-11.4.0.0.0.2.37783:20170805T092540Z
osnet pkg       : pkg://solaris/developer/opensolaris/osnet@5.12,5.12-5.12.0.0.0.125.0:20170530T145839Z


=== Task summary ===
server-tstamp         : pass
server-check.rti_ready: fail (non-fatal)
server-clobber        : pass
server-tools          : pass
server-setup          : pass
server-install        : pass
server-packages       : pass
server-setup-nd       : pass
server-install-nd     : pass
server-packages-nd    : pass
server-generate_tpl   : pass
server-closed_tarball : pass
server-pkgmerge       : pass
server-check          : pass
server-check.protocmp : pass
server-check.elf      : pass
server-check.ctf      : pass
server-check.lint     : pass
server-check.cstyle   : pass
server-check.cores    : pass
server-check.findunref: pass
server-check.paths    : pass
server-check.pmodes   : pass
server-check.uncommitted: pass
server-check.install-noise: fail (non-fatal)
server-check.wsdiff   : pass
server-check.lint-noise: pass
server-update_parent  : pass

check.parfait         : disabled


=== Task details ===
--- server-check.rti_ready ---
(Check that this build config can be submitted for RTI)

Starting task 'server-check.rti_ready'
bldrc file: /builds/ongk/workspace/nightly.build.server/bldrc
check.parfait was disabled
One or more bldrc file settings were found that suggest this build is not ready for RTI
Finishing task 'server-check.rti_ready'
'server-check.rti_ready' failed (non-fatal) and took 0:00:07.

At the time of writing, here are all of the build(1) tasks we implemented:

timf@whero[123] build help -v
Usage: build [subcommand] [options]
       build -i [options] [commands]
       build --help [-v]

Subcommands:

all_tests           (a synonym for 'check.all_tests')
archive             Archive build products
check               Run a series of checks on the source and proto trees
check.all_tests     Run all tests known to the workspace
check.cores         Look for core files dumped by build processes
check.cstyle        Do cstyle and hdrchck across the source tree
check.ctf           Check CTF data in the non-debug proto area
check.elf           Run a series of checks on built ELF objects
check.elfsigncmp    Determines whether elfsigncmp is used to sign binaries
check.findunref     Find unused files in the source tree
check.fish          Do checks across the fish subrepo
check.install-noise Looks for noise in the install.log file
check.lint          Do a 'dmake lint' on the source tree
check.lint-noise    Looks for noise in the check.lint.log file
check.parfait       Run parfait analysis on a built workspace
check.paths         Run checkpaths(1) on a built workspace
check.pmodes        Run a pmodes check on a built workspace
check.protocmp      Run protolist and protocmp on a built workspace
check.rti_ready     Check that this build config can be submitted for RTI
check.splice        Compare splice build repositories to baseline
check.tests         Run tests for sources changed since the last build
check.uncommitted   Look for untracked files in the workspace
check.wsdiff        Run wsdiff(1) to compare this and the previous proto area
clobber             Do a workspace clobber
closed_tarball      Generates tarballs containing closed binaries
cscope              (a synonym for 'xref')
explain             Print documentation about any configuration variable
fish                Build only the fish subrepo
fish.ai_iso         Build 'nas' Fish AI iso images only
fish.conf           Verify mkak options
fish.destroy_dc     Remove datasets tagged 'onbld:dataset' at/under 'dc_dataset'
fish.gk_images      Build Fish images appropriate for gk builds
fish.images         Build 'nas' Fish upgrade images only
fish.install        Build Fish sources, writing to the fish proto area
fish.jds_ai_iso     Build 'jds' Fish AI iso images only
fish.jds_all_images Build all 'jds' Fish images
fish.jds_gk_images  Build 'jds' Fish images appropriate for gk builds
fish.jds_images     Build 'jds' Fish upgrade images only
fish.jds_txt_iso    Build 'jds' Fish text iso images only
fish.nas_all_images Build all 'nas' Fish images
fish.packages       Build Fish IPS package archives
fish.re_build       Runs AK image construction tasks for Release Engineering
fish.save_artifacts Save all build artifacts from the 'dc_dataset' directory
fish.txt_iso        Build 'nas' Fish text iso images only
generate            Produce a default bldrc configuration file
generate_tpl        Generate THIRDPARTYLICENSE files
help                Print help text about one or all subcommands
here                Runs a 'dmake  ...' in the current directory
hgpull              Do a simple hg pull for all repositories in this workspace
install             Build OS sources, writing to the proto area
kerneltar           Create a tarball of the kernel from a proto area.
nightly             Do a build, running several other subcommands
packages            Publish packages to local pkg(7) repositories
parfait_remind_db   Generate a database needed by the parfait_remind pbchk
pid                 Print the PID of the build task executing for this workspace
pkgdiff             Compare reference and resurfaced package repositories
pkgmerge            Merge packages from one or more repositories
pkgserve            Serve packages built from this workspace over HTTP
pkgsurf             Resurface package repositories
pull                Do a hg pull and report new changesets/heads
qnightly            Do a nightly build only if new hg changesets are available
regenerate          Regenerate a bldrc using args stored in a given bldrc
save_packages       Move packages to $PKGSURFARCHIVE as a pkgsurf reference
serve               Serve the contents of the log directory over HTTP
setup               Do a 'dmake setup', required for 'install' and 'here'
test                Runs tests matching test.d/*.cfg file or section names
tools               Do a 'dmake bldtools', for 'setup', 'install', and 'here'
tstamp              Update a build timestamp file
update_diag_db      Download a new copy of the stackdb diagnostic database
update_diverge_db   Generate AK/Solaris divergence database
update_parent       Update a parent ws with data/proto from this workspace
xref                Build xref databases for the workspace

I hope to talk about a few of these in more detail in future posts, but feel free to ask if you’ve any questions.

In the end, I’m quite proud of the work we did on Lullaby – the build is significantly easier to use, the results easier to understand and since the Lullaby project integrated in 2014, we’ve found it very simple to maintain and extend.

However after we integrated, I have a feeling the folks looking for a new ON tech lead decided to give me a vested-interest in continuing to work on it, and so, “Party Like a Tech Lead” began!

Advertisements