This is the last of my posts documenting some of the tips and tricks I found useful during my time as Solaris ON technical lead, which hopefully might help others thinking of taking up similar positions.

I’d originally planned to talk about project reviews in this post, but since I ended up covering that last time, I’m instead going to talk a different challenge that you’ll face when doing this sort of job.

My transition out of this role took a little longer than expected. The first engineer we’d intended as my replacement got lured away to a different part of the company. However, one of the questions this engineer asked when I was explaining the sources of information he needed to keep track of, was

“How on earth do you keep up with all of this stuff?”

Here are some of the sources I was talking about:

  • every commit to the source tree, and a rough idea of its content
  • the above, except also for the Userland and IPS consolidations
  • every bug filed
  • every bug that lands on the Release Criteria list
  • every bug that might be considered a Release Stopper bug
  • every new ARC (Architecture Review Committee) case filed, and how the discussion is going
  • every IRC conversation about the gate and problems we discover in Solaris
  • every test report
  • every mailing list post (ON, Userland and IPS-related, along with a host of other Solaris-related lists)
  • every mail addressed specifically to the gate staff
  • every project being submitted for review
  • every build failure

This amount of information is clearly a recipe for insanity, and perhaps one of the reasons why we try to rotate the tech lead position every 18 months or so, but there has to be a way of dealing with it all while keeping your head.

I’ll talk about some of the techniques I used a little later, but first, I want to explain why all of that information is useful.

These sources helped me maintain a general sense of project awareness, so that if someone was talking about a bug they’d just encountered, I’d be able to recall whether that subsystem had changed recently, or whether that problem sounded similar to a bug that had recently been filed. If people were seeing build failures, I’d hopefully be able to recognise them from the ones I’d seen before.

Gathering and keeping track of all that information often helped answer questions that triggered my “that’s funny” response – a “spidey sense” for the gate, if you like. Even if the problem being described didn’t exactly match one I’d seen before, just getting an inkling that a problem sounded similar to another was often a thread we could pull to unravel the cause of the mystery!

I felt that one of the roles of a tech lead, is to act as a sort of human information clearing house. I wasn’t expected to know the nitty gritty details of every subsystem in Solaris, but I was expected to know enough about what they do, the impact of them breaking, and who to call if that happened.

[ Aside: I am not claiming to be a “Full Stack Engineer” – honestly, the concept seems ludicrous. Yes, I’m sure I could write a device driver given time to learn. I think I could write a half-decent web-ui if needed, and sure, I could maybe debug kernel problems or resolve weird Django problems, but I wouldn’t be the best person to do any of those things: humans specialise for a reason. ]

So, with so many sources of information, how do you keep track of them all?

Read

There’s no escaping it – you need to spend a lot of time reading email. As much as I agree with my colleague Bart Smaalders, who maintains that “You will contribute more with mercurial than with thunderbird”, that spending your time arguing on mailing lists eventually becomes counter-productive without code to back it up, but reading email is still a necessary evil.

Everyone has their own way of filtering mailing lists – mine was to sort roughly by category (not mailing list) and use Thunderbird tags to assign those colours and move mails into subfolders for searching later. I’d mark those deleted mails with strikethroughs in my inbox, rather than simply hiding them until the next time I compacted my mail, but each mail message Subject: line would retain the color I assigned to it.

This meant that on a typical morning, at a glance I could quickly tell which categories were active, based on the different rainbow-hue of my inbox. A mostly blue day meant a lot of bug updates. Red days were security-related issues, orange meant lots of gatekeeper@ activity, etc. Most days, if you mixed the colours together, would have been a muddy brown colour, no doubt!

Next, our putback mails are normally just the standard ‘hg notify‘ messages, however we have a separate mailing list that also includes full diffs of each commit – I subscribed to those, and used to scan each commit as it arrived.

I’ll talk a little more about a tool I wrote to keep a sense of gate awareness later, but despite that, having mail notifications for each putback and convenient access to its content was still incredibly useful.

Bugs

My predecessor, Artem Kachitchkine wrote a wonderful daily-bugs web page, which I’d check each morning to see what issues were being filed, concentrating on higher priority bugs, or those with interesting synopses. I’d keep in mind what recent changes had integrated to the source tree and, for severe problems that were likely to impede testing, I’d often get in touch with the submitter or responsible engineer straight away.

This web UI allowed you to quickly page back to previous days, and this chronological view of bugs was really useful. (e.g. “Oh, that bug was filed before this particular change integrated”, without having to formulate bug database queries)

Several years ago, Solaris transitioned from the old Sun “bugster” database to Oracle’s internal bug database. This made us happy and sad at the same time. The downside was that the web UI was (and still is) utterly horrendous, however, the really big upside, was that everyone finally had raw SQL access to the database, something that was only available to a select few at Sun.

So, wanting to avoid that terrible UI, I wrote a simple mod_wsgi-based REST-like bug browser, allowing me to form simple queries of the form:

http://<server>/report/<category>/<subcategory >

to browse bug categories or view open bugs against a specific subcategory. I added a count=n parameter to limit the number of results returned, a rfes=true|false to optionally hide Enhancement requests (a separate class of bug report), and a special “/all” URI component to show all bugs in that category.

I added a “/tags” URI component that would ‘AND’ subsequent URI components to show bugs with specific tags, e.g.

http://<server>/report/utility/filesystem/tags/slp64

would show bugs against filesystem-related utilities that were marked as being interesting for 64-bit conversion.

Very often, simply being able to quickly find the most recent bugs filed against a specific subsystem was terribly useful, and in cases where a bug could feasibly be filed against a series of areas, having fast-access to those was great.

Chat logs

Keeping local logs of the main ON chatroom was especially useful to me as a non-US developer. Each morning, I’d catch up on the previous night’s conversation to see what problems people were reporting, and would often start investigating or sending mails to see if we could make timezones work to our advantage and have the problem solved before the US folks woke up.

Having those in plaintext meant that I could grep for cases where I thought we’d discussed similar problems (possibly) years ago and quickly be able to find answers.

Spelunk

To help out with my gate-awareness, I wrote a tool that would allow me to quickly cross-reference source tree commits with the bug database, and to track each commit’s impact on the “proto area” (the set of files and binaries from a build that made up the packages Solaris was constructed from)

In a not-quite text-book example of “Naming things is hard”, I unfortunately called this utility “spelunk”, trying to evoke the sentiment of source code archeology I was aiming for (but no, absolutely nothing to do with splunk – sorry)

I mentioned before that everyone had SQL access to the bug database, so this was just a case of writing a small SQLite-based application in Python to suck in the metadata from Mercurial changesets pushed to the gate, the results of a very very debug-oriented compilation of the source tree (so each compiled object contained DWARF data showing which source files each had been compiled with) along with a few other heuristics to map files from the proto area to the source tree.

Once that was done, I just needed a Jenkins job to update the database periodically and we were in business!

I then wrote a simple shell-script front end to the database that would allow me to execute arbitrary pre-canned SQL queries (looking up a set of files in a shared “query.d/” directory) ending up with a useful CLI that looked like this:

$ spelunk --help
usage: spelunk [options] [command]

options:

	-d|--database	Show when the spelunk database was last updated
	-h|--help       Show help
	-l|--list       Show known subcommands
	-p|--plain      Use simple text columns
	-s|--sqlite     Open an sqlite3 shell
	-v|--verbose    Verbose (show SQL)
	-w|--web        Format the output as html

$ spelunk

spelunk: no command listed. Use -h for help.

Commands:

NAME              DESCRIPTION
----              -----------
2day              putbacks from the last two days 
2dayproto         proto files changed in the last two days
2daysources       sources changed in the last two days
blame             who to blame for this source file (needs arg1)
bug               show this bugid (needs arg1)
build             fixes made in a specific build (needs arg1, a gate tag)
changeset         Show the bugs, sources and proto effects for this changeset (needs arg1)
lastbuild         fixes made in the build we just closed
proto             the proto files associated with this source file (needs arg1, a source file)
protobug          bugs and cat/subcat for this proto file (needs arg1, a proto file)
protochangeset    what this changeset does to the proto area
recentfilechanges recent changes to a source file pattern
source            the sources associated with this proto file (needs arg1, a proto file)
sourcebugs        bugs for this source file (needs arg1)
synopsis          show the bugs matching this synopsis (needs arg1)
tags              tags added to $SRC
thisbuild         fixes made so far in this build
today             fixes done in the last 24 hours
todayproto        changes to the proto area in the last 24 hours
todaysources      sources changed in the last 24 hours
week              fixes done in the last 7 days
weekproto         proto files changed in the last 7 days
weeksources       sources changed in the last 7 days
whatbuild         determine what build a bug was fixed in (needs arg1, a bugid)
whatcat           determine what category to file a bug against (needs arg1, a proto path)
yesterday         fixes done yesterday (ish)
yesterdayproto    proto files changed yesterday (ish)
yesterdaysources  sources changed yesterday (ish)

This was an excellent way of keeping on top of the ebbs and flows of development in the gate, and I found it really helpful to be able to write structured queries against the source tree.

With some work, Mercurial revsets would have helped with some of these sorts of queries, but they couldn’t answer all of the questions I wanted to ask since that required knowledge of:

  • mercurial metadata
  • bugdb metadata
  • a mapping of source files to compiled binaries

Most of the time, the CLI interface was all that I needed, but having this database on hand ended up being incredibly useful when dealing with with other problems we faced during my time in the job.

Here’s some sample questions and the resulting database query:

1. What delivers /etc/motd?

sqlite> .header on
sqlite> SELECT * FROM proto_source WHERE proto_path='etc/motd';

2. What bug category should I use to log a bug against /usr/sbin/svcadm ?

sqlite> SELECT cat, subcat, COUNT(cat) AS cat_count, COUNT(subcat) AS subcat_count
   ...> FROM bug, bug_source, proto_source WHERE
   ...>     proto_source.source_path=bug_source.path AND
   ...>     bug.bugid=bug_source.bugid AND
   ...>     proto_source.proto_path='usr/sbin/svcadm' ORDER BY cat_count, subcat_count LIMIT 1;

3. What recent changes happened that might have caused zpool to core dump?

sqlite> SELECT DISTINCT bug.bugid,changeset,synopsis,date(date, 'unixepoch') FROM bug, proto_source, bug_source WHERE
   ...>     proto_source.source_path=bug_source.path AND
   ...>     bug.bugid=bug_source.bugid AND
   ...>     proto_source.proto_path='usr/sbin/zpool' ORDER BY date DESC LIMIT 10;

4. I’m seeing a VM2 panic – who should I to talk to?
(actually, you could probably use the bug database, our developer IRC
channel or a mailing list for this, but still…)

sqlite> SELECT bugid, synopsis, author from bug
   ...>     WHERE synopsis like '%vm2%' ORDER BY date DESC LIMIT 10;

5. Show me some recent changes that affected sfmmu.

sqlite>  SELECT bug.bugid, path, date(date, 'unixepoch'), changeset from bug
   ...>     NATURAL JOIN bug_source WHERE path LIKE '%uts/sfmmu/%' ORDER BY date DESC LIMIT 10;

6. What changes were made to RAD in July 2015?

sqlite> SELECT changeset, DATE(date, 'unixepoch'), path, author FROM bug
   ...>     NATURAL JOIN bug_source WHERE DATE(date, 'unixepoch') BETWEEN date('2015-07-01') AND DATE('2015-07-31') AND
   ...>     path LIKE '%/rad/%';

7. Did we make any changes to the CIFS kernel modules since s12_77?

sqlite> SELECT DISTINCT bug.bugid, date(date, 'unixepoch'), cat, subcat, synopsis, changeset FROM bug, bug_source
   ...>     NATURAL JOIN bug_source WHERE date > (SELECT date FROM tags WHERE name='on12_77')
   ...>     AND path LIKE "%uts/common/fs/smb%" ORDER BY date DESC;

(of course, this wasn’t perfect – the fact that Mercurial changesets don’t include integration-date but instead changeset-creation-date means the date-related queries weren’t quite right, but they were often close enough.)

People

It’s not all about tools, of course, which brings me to the last, and most important way of dealing with the information you need to process when doing this job.

Being able to have an answer to the Ghostbusters question (“who ya gonna call?”) when trying to assess the severity of a bug, or understand some breakage you’re seeing will really help to resolve issues quickly. As I’ve said already, one person can’t possibly debug every problem you’re going to face, but having a set of helpful contacts makes life immeasurably easier.

When talking to engineers about problems, be ready to point to an error message, a crash dump, or ideally a machine exhibiting the problem – that will make them much more willing to drop what they’re doing and quickly take a look. Just remember, these folks have day-jobs too, so do try to work out the problem or file a bug yourself first!

Finally, if you’re one of the many people I’ve pestered over the last 3.25 years, thank you so much, you’ve made working in Solaris a better place!

Advertisements