I now think that blog 'per day' pages with articles are a mistake

https://utcc.utoronto.ca/~cks/space/blog/web/BlogDroppingPerDayPages

Back in 2005 when I wrote DWiki, the engine that is used for Wandering Thoughts, there was an accepted standard structure for blogs that people followed, me included. For instance, it was accepted convention that the front page of your blog showed a number of the most recent articles, and you could page backward to older ones. Part of this structure was the idea that you would have a page for each day and that page would show the article or articles written that day (if any). When I put together DWiki's URL structure for blog-like areas, I followed this, and to this day Wandering Thoughts has these per-day pages.

I now think that these per day pages are not the right thing to do on the modern web (for most blogs), for three reasons. The first reason is that they don't particularly help real blog usability, especially getting people to explore your blog after they land on a page. Most people make at most one post a day, so exploring day by day doesn't really get you anything more than links in a blog entry to the next entry and the previous entry will (and if the links have the destination's title, they will probably be giving you more information than a day).

The second reason is that because they duplicate content from your actual articles, they confuse search engine based navigation. Perhaps the search engine will know that the actual entry is the canonical version and present that in preference to the per-day page where the entry is also present, but perhaps not. And if you do have two entries in one day, putting both of their texts on one page risks disappointment in someone who is searching for a combination of terms where one term is only in one entry and the other term is in a second.

The third and weakest reason is a consequence of how on the modern web, everything gets visited. Per-day pages are additional pages in your blog and web crawlers will visit them, driving up your blog's resource consumption in the process. These days my feelings are that you generally want to minimize the number of pages in your blog, not maximize them, something I've written about more in The drawback of having a dynamic site with lots of URLs on today's web. But this is not a very strong reason, if you have a reasonably efficient blog and you serve per-day pages that don't have the full article text.

I can't drop per-day pages here on Wandering Thoughts, because I know that people have links to them and I want those links to keep working as much as possible. The simple thing to do is to stop putting full entries on per-day pages, and instead just put in their title and a link to them (just as I already do on per-month and per-year pages); this at least gets rid of the duplication of entry text and makes it far more likely that search engine based navigation will deliver people to the actual entry. The more elaborate thing would be to automatically serve a HTTP redirect to the entry for any per-day page that had only a single entry.

(For relatively obvious reasons you'd want to make this a temporary redirect.)

There's a bit of me that's sad about this shift in blog design and web usage; the per-day, per-month, and per-year organization had a pleasant regularity and intuitive appeal. But I think its time has passed. More and more, we're all tending toward the kind of minimal URL structure typical of static sites, even when we have dynamic sites and so could have all the different URL structures and ways of accessing our pages that we could ask for.

A Go lesson learned: sometimes I don't want to use goroutines if possible

https://utcc.utoronto.ca/~cks/space/blog/programming/GoWhenNotManyGoroutines

We have a heavily NFS based server environment here, with multiple NFS servers and an IMAP server that accesses all mailboxes over NFS. That IMAP server has had ongoing issues with elevated load averages, and what at least seems to be IMAP slowness. However, our current metrics leave a lot of uncertainties about the effects of all of this, because we basically only have a little bit of performance data for a few IMAP operations. One thing I'd like to do is gather some very basic Unix level NFS performance data from our IMAP server and from some other machines, to see if I can see anything.

One very simple metric is how long it takes to read a little file from every NFS filesystem we have mounted on a machine. As it happens, we already have the little files (they're used for another system management purpose), so all I need is a program to open and read each one while timing how long it takes. There's an obvious issue with doing this sequentially, which is that if there's a single slow filesystem, it could delay everything else.

The obvious answer here was Go, goroutines, and some form of goroutine pool. Because the goroutines just do IO (and they're only being used to avoid one bit of IO delaying another separate bit), the natural size of the goroutine pool is fairly large, say 50 to 100 goroutines (we have a lot of NFS filesystems). This is quite easy and obvious to implement in Go, so I put together a little Go program for it and watched the numbers it generated as they jumped around.

Then, out of reflexive caution, I tried running the same program with a goroutine pool size of 1, which more or less forced serial execution (the pool goroutine infrastructure was still active but there was only one worker goroutine doing all the reading). To my surprise the 'time to read a file' number for all filesystems was visibly and decidedly lower. I could run the program side by side with the two different goroutine pool sizes and see this clearly.

Some thinking gave me a possible reason why this is so. My core code does essentially the following (minus error checking):

start := time.Now()
file, err := os.Open(target)
n, err := file.Read(buffer)
duration := time.Now().Sub(start)

This sequence makes two system calls and each system call is a potential goroutine preemption point. If a goroutine gets preempted during either system call, it can only record the finishing time once it's scheduled again (and finishes the read, if it was preempted in the open). If there are 50 or more goroutines all doing this, some of them could well be preempted and then not scheduled for some time, and that scheduling delay will show up in the final duration. When there aren't multiple goroutines active, there should be very little scheduling delay and the recorded durations (especially the longest durations) will be lower. And the ideal situation for essentially no goroutine contention is of course one goroutine.

(Technically this makes two more system calls to get the time at the start and the end of the sequence, but on modern systems, especially Linux, these don't take long enough to trigger Go's system call preemption and probably don't even enter the kernel itself.)

Because I still worry about individual slow filesystems slowing everything down (or stalls on some filesystems), my solution was a more complicated work pool approach that starts additional worker goroutines only when all of the current ones seem to have stalled for too long. If all goes well (and it generally does in my testing), this runs with only one goroutine.

(My current code has the drawback that once the goroutine worker pool expands, all of them stay active, which means that enough slow filesystems early on in the checks could get me back to the thundering herd problem. I'm still thinking about that issue.)

How you get multiple TLS certificate chains from a server certificate

https://utcc.utoronto.ca/~cks/space/blog/tech/TLSHowMultipleChains

I've known and read for some time that a single server certificate can have more than one chain to a root certificate that you trust, but I never really thought about the details of how this worked. Then the AddTrust thing happened, I started writing about how Prometheus's TLS checks would have reacted to it, and Guus left a comment on that entry that got me thinking about what else Prometheus could sensibly look at here. So now I want to walk through the mechanics of multiple TLS chains to get this straight in my head.

Your server certificate and the other TLS certificates in a chain are each signed by an issuer; in a verified chain, this chain of issuers eventually reaches a Certificate Authority root certificate that people have some inherent trust in. However, a signed certificate doesn't specifically name and identify the issuer's certificate by, say, its serial number or hash; instead issuers are identified by their X.509 Subject Name and also at least implicitly by their keypair (and sometimes explicitly). By extension, your signed certificate also identifies the key type of the issuer's certificate; if your server certificate is signed by RSA, an intermediate certificate with an ECDSA keypair is clearly not the correct parent certificate.

(Your server certificate implicitly identifies the issuer by keypair because they signed your certificate with it; an intermediate certificate with a different keypair can never validate the signature on your certificate.)

However, several certificates can have the same keypair and X.509 Subject Name, provided that other attributes differ. One such attribute is the issuer that signed them (including whether this is a self-signed CA root certificate). So the first thing is that having more than one certificate for an issuer is generally required to get multiple chains. If you only have one certificate for each issuer, you can pretty much only build a single chain.

There are three places that these additional certificates for an issuer can come from; they can be sent by the server, they can be built into your certificate store in advance, or they can be cached because you saw them in some other context. The last is especially common with browsers, which often cache intermediate certificates that they see and may use them in preference to the intermediate certificate that a TLS server sends. Other software is generally more static about what it will use. My guess is that we're unlikely to have multiple certificates for a single CA root issuer, at least for modern CAs and modern root certificate sets as used by browsers and so on. This implies that the most likely place to get additional issuer certificates is from intermediate certificates sent by a server.

(In any case, it's fairly difficult to know what root certificate sets clients are using when they talk to your server. If your server sends the CA root certificate you think should be used as part of the certificate chain, a monitoring client (such as Prometheus's checks) can at most detect when it's got an additional certificate for that CA root issuer in its own local trust store.)

One cause of additional issuer certificates is what's called cross-signing a CA's intermediate certificate, as is currently the case with Let's Encrypt's certificates. In cross-signing, a CA generate two versions of its intermediate certificate, using the same X.509 Subject Name and keypair; one is signed by its own CA root certificate and one is signed by another CA root certificate. A CA can also cross-sign its own new root certificate (well, the keypair and issuer) directly, as is the case with the DST Root CA X3 certificate that Let's Encrypt is currently cross-signed with; one certificate for 'DST Root CA X3' is self-signed and likely in your root certificate set, but two others existed that were cross-signed by an older DST CA root certificate.

(As covered in the certificate chain illustrations in Fixing the Breakage from the AddTrust External CA Root Expiration, this was also the case with the expiring AddTrust root CA certificate. The 'USERTrust RSA Certification Authority' issuer was also cross-signed to 'AddTrust External CA Root', a CA root certificate that expired along with that cross-signed intermediate certificate. And this USERTrust root issuer is still cross signed to another valid root certificate, 'AAA Certificate Services'.)

This gives us some cases for additional issuer certificates:

  • your server's provided chain includes multiple intermediate certificates for the same issuer, for example both Let's Encrypt intermediate certificates. A client can build one certificate chain through each.

  • your server provides an additional cross-signed CA certificate, such as the USERTrust certificate signed by AddTrust. A client can build one certificate chain that stops at the issuer certificate that's in its root CA set, or it can build another chain that's longer, using your extra cross-signed intermediate certificate.

  • the user's browser knows about additional intermediate certificates and will build additional chains using them, even though your server doesn't provide them in its set of certificates. This definitely happens, but browsers are also good about handling multiple chains.

In a good world, all intermediate certificates will have an expiration time no later than the best certificate for the issuer that signed them. This was the case with the AddTrust expiration; the cross-signed USERTrust certificate expired at the same time as the AddTrust root certificate. In this case you can detect the problem by noticing that a server provided intermediate certificate is expiring soon. If only a CA root certificate at the end of an older chain is expiring soon and the intermediate certificate signed by it has a later expiration date, you need to check the expiration time of the entire chain.

As a practical matter, monitoring the expiry time of all certificates provided by a TLS server seems very likely to be enough to detect multiple chain problems such as the AddTrust issue. Competent Certificate Authorities shouldn't issue server or intermediate certificates with expiry times later than their root (or intermediate) certificates, so we don't need to try to find and explicitly check those root certificates. This will also alert on expiring certificates that were provided but that can't be used to construct any chain, but you probably want to get rid of those anyway.

Sidebar: Let's Encrypt certificate chains in practice

Because browsers do their own thing, a browser may construct multiple certificate chains for Let's Encrypt certificates today even if your server only provides the LE intermediate certificate that is signed by DST Root CA X3 (the current Let's Encrypt default for the intermediate certificate). For example, if you visit Let's Encrypt's test site for their own CA root, your browser will probably cache the LE intermediate certificate that chains to the LE CA root certificate, and then visiting other sites using Let's Encrypt may cause your browser to ignore their intermediate certificate and chain through the 'better' one it already has cached. This is what currently happens for me on Firefox.

What a TLS self signed certificate is at a mechanical level

https://utcc.utoronto.ca/~cks/space/blog/tech/TLSWhatIsSelfSignedCert

People routinely talk about self signed TLS certificates. You use them in situations where you just need TLS but don't want to set up an internal Certificate Authority and can't get an official TLS certificate, and many CA root certificates are self signed. But until recently I hadn't thought about what a self signed certificate is, mechanically. So here is my best answer.

To simplify a lot, a TLS certificate is a bundle of attributes wrapped around a public key. All TLS certificates are signed by someone; we call this the issuer. The issuer for a certificate is identified by their X.509 Subject Name, and also at least implicitly by the keypair used to sign the certificate (since only an issuer TLS certificate with the right public key can validate the signature).

So this gives us the answer for what a self signed TLS certificate is. It's a certificate that lists its own Subject Name as the issuer and is signed with its own keypair (using some appropriate key signature algorithm, such as SHA256-RSA for RSA keys). It still has all of the usual TLS certificate attributes, especially 'not before' and 'not after' dates, and in many cases they'll be processed normally.

Self signed certificates are not automatically CA certificates for a little private CA. Among other things, the self-signed certificate can explicitly set an 'I am not a CA' marker in itself. Whether software respects this if someone explicitly tells it to trust the self-signed certificate as a CA root certificate is another matter, but at least you tried.

Self-signed certificates do have a serial number (which should be unique), and a unique cryptographic hash. Browsers that have been told to trust a self-signed certificate are probably using either these or a direct comparison of the entire certificate to determine if you're giving them the same self-signed certificate, instead of following the process used for identifying issuers (of checking the issuer Subject Name and so on). This likely means that if you re-issue a self-signed certificate using the same keypair and Subject Name, browsers may not automatically accept it in place of your first one.

(As far as other software goes, who knows. There are dragons all over those hills, and I suspect that there is at least some code that accepts a matching Subject Name and keypair as good enough.)

Link: Code Only Says What it Does

https://utcc.utoronto.ca/~cks/space/blog/links/CodeOnlySaysWhatItDoes

Marc Booker's Code Only Says What it Does (via) is about what code doesn't say and why all of those things matter. Because I want you to read the article, I'm going to quote all of the first paragraph:

Code says what it does. That's important for the computer, because code is the way that we ask the computer to do something. It's OK for humans, as long as we never have to modify or debug the code. As soon as we do, we have a problem. Fundamentally, debugging is an exercise in changing what a program does to match what it should do. It requires us to know what a program should do, which isn't captured in the code. Sometimes that's easy: What it does is crash, what it should do is not crash. Outside those trivial cases, discovering intent is harder.

This is not an issue that's exclusive to programming, as I've written about in Configuration management is not documentation, at least not of intentions (procedures and checklists and runbooks aren't documentation either). In computing we love to not write documentation, but not writing down our intentions in some form is just piling up future problems.

The work that's not being done from home is slowly accumulating for us

https://utcc.utoronto.ca/~cks/space/blog/sysadmin/WorkNotDoneFromHome

At the moment, the most recent things I've seen have talked about us not returning to the office before September, and then not all of us at one time. This gives me complicated feelings, including about what work we are doing. From the outside, our current work from home situation probably looks like everything is going pretty well. We've kept the computing lights on, things are being done, and so far all of the ordinary things that people ask of us get done as promptly as usual. A few hardware issues have come up and have been dealt with by people making brief trips into the office. So it all looks healthy; you might even wonder why we need offices.

When I look at the situation from inside, things are a bit different. We may be keeping the normal lights on, but at the same time there's a steadily growing amount of work that is not being done because of our working from home. The most obvious thing is that ordering new servers and other hardware has basically been shut down; not only are we not in the office to work on any hardware, it mostly can't even be delivered to the university right now.

The next obvious thing is the timing of any roll out of Ubuntu 20.04 on our machines. Under normal circumstances, we'd have all of the infrastructure for installing 20.04 machines ready and probably some test machines out there for people to poke at, and we'd be hoping to migrate a number of user-visible machines in August before the fall semester starts. That's looking unlikely, since at this point all we have is an ISO install image that's been tested only in temporary virtual machines. Since we haven't been in the office, we haven't set up any real servers running 20.04 on an ongoing basis. We're in basically the bad case situation I imagined back in early April.

(And of course many of the people we'd like to have poke at 20.04 test machines are busy with their own work from home problems, so even if we had test machines, they would probably get less testing than usual.)

Another sign is that many of our Ubuntu servers have been up without a reboot for what is now an awfully long time for us. Under normal circumstances we might have scheduled a kernel update and reboot by now, but under work from home conditions we only want to take the risk of doing kernel updates and rebooting important servers if there is something critical. If something goes wrong, it's not a walk down to the machine room, it's a trip into the office (and a rather longer downtime).

There's also a slowly accumulating amount of pending physical networking work, where we're asked to change what networks particular rooms or network ports are on because people are moving around. This work traditionally grows as the fall semester approaches and space starts getting sorted out for new graduate students and so on, although that could change drastically this year depending on the university's overall plans for what graduate students will do and where they will work.

(To put it one way, a great deal of graduate student space is not set up for appropriate physical distancing. Nor is a fair amount of other office and lab space.)

One level up from this is that there's a number of projects that need to use some physical servers. We have a bunch of OpenBSD machines on old OpenBSD versions that could do with updates (and refreshes on to new hardware), for example, but we need to build them out in test setups first. Another example is that we have plans to significantly change how we currently use SLURM, but that needs a few machines to set up a little new cluster (on our server network, as part of our full environment).

(A number of these projects need custom network connectivity, such as new test firewalls needing little test networks. Traditionally we build this in some of our 'lab space', with servers just sitting on a table wired together.)

Much of this is inherent in us having and using physical servers. Having physical servers in a machine room means receiving new hardware, racking it, cabling it up, and installing it, all of which we have to do in person (plus at least pulling the cables out of any old hardware that it's replacing). Some of it (such as our reluctance to reboot servers) is because we don't have full remote KVM over IP capabilities on our servers.

PS: We're also lucky that all of this didn't happen in a year when we'd planned to get and deploy a major set of hardware, such as the year when we got the hardware for our current generation of fileservers.

In ZFS, your filesystem layout needs to reflect some of your administrative structure

https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSAdminVsFilesystemLayout

One of the issues we sometimes run into with ZFS is that ZFS essentially requires you to reflect your administrative structure for allocating and reserving space in how you lay out ZFS filesystems and filesystem hierarchies. This is because in ZFS, all space management is handled through the hierarchy of filesystems (and perhaps in having multiple pools). If you want to make two separate amounts of space available to two separate sets of filesystems (or collectively reserved by them), either they must be in different pools or they must be under different dataset hierarchies within the pool.

(These hierarchies don't have to be visible to users, because you can mount ZFS filesystems under whatever names you want, but they exist in the dataset hierarchy in the pool itself and you'll periodically need to know them, because some commands require the full dataset name and don't work when given the mount point.)

That sounds abstract, so let me make it concrete. Simplifying only slightly, our filesystems here are visible to people as /h/NNN (for home directories) and /w/NNN (workdirs, for everything else). They come from some NFS server and live in some ZFS pool there (inside little container filesystems), but the NFS server and to some extent the pool is an implementation detail. Each research group has its own ZFS pool (or for big ones, more than one pool because one pool can only be so big), as do some individual professors. However, there are not infrequently cases where a professor in a group pool would like to buy extra space that is only for their students, and also this professor has several different filesystems in the pool (often a mixture of /h/NNN homedir filesystems and /w/NNN workdir ones).

This is theoretically possible in ZFS, but in order to implement it ZFS would force us to put all of a professor's filesystems under a sub-hierarchy in the pool. Instead of the current tank/h/100 and tank/w/200, they would have to be something like tank/prof/h/100 and tank/prof/w/200. The ZFS dataset structure is required to reflect the administrative structure of how people buy space. One of the corollaries of this is that you can basically only have a single administrative structure for how you allocate space, because a dataset can only be in one place in the ZFS hierarchy.

(So if two professors want to buy space separately for their filesystems but there's a filesystem shared between them (and they each want it to share in their space increase), you have a problem.)

If there were sub-groups of people who wanted to buy space collectively, we'd need an even more complicated dataset structure. Such sub-groups are not necessarily decided in advance, so we can't set up such a hierarchy when the filesystems are created; we'd likely wind up having to periodically modify the dataset hierarchy. Fortunately the manpages suggest that 'zfs rename' can be done without disrupting service to the filesystem, provided that the mountpoint doesn't change (which it wouldn't, since we force those to the /h/NNN and /w/NNN forms).

While our situation is relatively specific to how we sell space, people operating ZFS can run into the same sort of situation any time they want to allocate or control collective space usage among a group of filesystems. There are plenty of places where you might have projects that get so much space but want multiple filesystems, or groups (and subgroups) that should be given specific allocations or reservations.

PS: One reason not to expose these administrative groupings to users is that they can change. If you expose the administrative grouping in the user visible filesystem name and where a filesystem belongs shifts, everyone gets to change the name they use for it.

The unfortunate limitation in ZFS filesystem quotas and refquota

https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSHierarchyQuotaLack

When ZFS was new, the only option it had for filesystems quotas was the quota property, which I had an issue with and which caused us practical problems in our first generation of ZFS fileservers because it covered the space used by snapshots as well as the regular user accessible filesystem. Later ZFS introduced the refquota property, which did not have that problem but in exchange doesn't apply to any descendant datasets (regardless of whether they're snapshots or regular filesystems). At one level this issue with refquota is fine, because we put quotas on filesystems to limit their maximum size to what our backup system can comfortably handle. At another level, this issue impacts how we operate.

All of this stems from a fundamental lack in ZFS quotas, which is ZFS's general quota system doesn't let you limit space used only by unprivileged operations. Writing into a filesystem is a normal everyday thing that doesn't require any special administrative privileges, while making ZFS snapshots (and clones) requires special administrative privileges (either from being root or from having had them specifically delegated to you). But you can't tell them apart in a hierarchy, because ZFS only you offers the binary choice of ignoring all space used by descendants (regardless of how it occurs) or ignoring none of it, sweeping up specially privileged operations like creating snapshots with ordinary activities like writing files.

This limitation affects our pool space limits, because we use them for two different purposes; restricting people to only the space that they've purchased and insuring that pools always have a safety margin of space. Since pools contain many filesystems, we must limit their total space usage using the quota property. But that means that any snapshots we make for administrative purposes consume space that's been purchased, and if we make too many of them we'll run the pool out of space for completely artificial reasons. It would be better to be able to have two quotas, one for the space that the group has purchased (which would limit only regular filesystem activity) and one for our pool safety margin (which would limit snapshots too).

(This wouldn't completely solve the problem, though, since snapshots still consume space and if we made too many of them we'd run a pool that should have free space out of even its safety margin. But it would sometimes make things easier.)

PS: I thought this had more of an impact on our operations and the features we can reasonable offer to people, but the more I think about it the more it doesn't. Partly this is because we don't make much use of snapshots, though, for various reasons that sort of boil down to 'the natural state of disks is usually full'. But that's for another entry.

How Prometheus Blackbox's TLS certificate metrics would have reacted to AddTrust's root expiry

https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusBlackboxVsAddTrust

The last time around I talked about what Blackbox's TLS certificate expiry metrics are checking, but it was all somewhat abstract. The recent AddTrust root expiry provides a great example to make it concrete. As a quick summary, the Blackbox exporter provides two metrics, probe_ssl_earliest_cert_expiry for the earliest expiring certificate and probe_ssl_last_chain_expiry_timestamp_seconds for the latest expiring verified chain of certificates.

If your TLS server included the expiring AddTrust root certificate as one of the chain certificates it was providing to clients, the probe_ssl_earliest_cert_expiry metric would have counted down and your alarms would have gone off, despite the fact that your server certificate itself wasn't necessarily expiring. This would have happened even if the AddTrust certificate wasn't used any more and its inclusion was just a vestige of past practices (for example if you had a 'standard certificate chain set' that everything served). In this case this would have raised a useful alarm, because the mere presence of the AddTrust certificate in your server's provided chain caused problems in some (or many) TLS libraries and clients.

(Browsers were fine, though.)

Even if your TLS server included the AddTrust certificate in its chain and your server certificate could use it for some verified chains, the probe_ssl_last_chain_expiry_timestamp_seconds would not normally have counted down. Most or perhaps all current server certificates could normally be verified through another chain that expired later, which is what matters here. If probe_ssl_last_chain_expiry_timestamp_seconds had counted down too, it would mean that your server certificate could only be verified through the AddTrust certificate for some reason.

Neither metric would have told you if the AddTrust certificate was actually being used by your server certificate through some verified chain of certificates, or if it was now completely unnecessary. Blackbox's TLS metrics don't currently provide any way of knowing that, so if you need to monitor the state of your server certificate chains you'll need another tool.

(There's a third party SSL exporter, but I don't think it does much assessment of chain health, or give you enough metrics to know if a server provided chain certificate is unnecessary.)

If you weren't serving the AddTrust root certificate and had a verified chain that didn't use it, but some clients required it to verify your server certificate, neither Blackbox metric would have warned you about this. Because you weren't serving the certificate, probe_ssl_earliest_cert_expiry would not have counted down; it includes only TLS certificates you actually serve, not all of the TLS certificates required to verify all of your currently valid certificate chains. And probe_ssl_last_chain_expiry_timestamp_seconds wouldn't have counted down because there was an additional verified chain besides the one that used the AddTrust root certificate.

(In general it's very difficult to know if some client is going to have a problem with your certificate chains, because there are many variables. Including outright programming bugs, which were part of the problem with AddTrust. If you want to be worried, read Ryan Sleevi's Path Building vs Path Verifying: Implementation Showdown.)

Adapting our Django web app to changing requirements by not doing much

https://utcc.utoronto.ca/~cks/space/blog/python/DjangoAppAdaptations

We have a Django web application to handle (Unix) account requests, which is now nine years old. I've called this utility code, but I mentioned recently that over that time there have been some changes in how graduate students were handled that needed some changes in the application. Except not very much change was necessary, in some ways, and in other ways the changes are hacks. So here are some stories of those changes.

When we (I) initially wrote the web application, our model of how new graduate students got Unix accounts was straightforward. All graduate students were doing a thesis (either a Masters or a PhD) and so all of them have a supervising professor. As a long standing matter of policy, that supervisor was their account sponsor, and so approved their account request. Professors can also sponsor accounts for other people associated with them, such as postdocs.

(This model already has a little glitch; some students are co-supervised by more than one professor. Our system requires one to be picked as the account sponsor, instead of somehow recording them as co-sponsored, which has various consequences that no one has complained about so far.)

The first change that showed up was that the department developed a new graduate program, the Master of Science in Applied Computing. Graduate students in the MScAC program don't write a thesis and as a result they don't have a supervising professor. As it happened, we already had a model for solving this, because Unix accounts for administrative and technical staff are not sponsored by professors either; they have special non-professor sponsors. So we added another such special sponsor for MScAC students. This was not sufficient by itself, because the account request system sometimes emails new graduate students and the way those messages were written assumed that the student's sponsor was supervising them.

Rather than develop a general solution to this, we took the brute force solution of an '{% if ...}' condition in the relevant Django template. Because of how our data is set up, this condition both has to reach through several foreign keys and uses a fixed text match against a magic name, instead of checking any sort of flag or status marker (because no such status marker was in the original data model). Fortunately the name it matches against is not exposed to people, because the official name for the program has actually changed over time but our internal name has never been updated (partly because it was burned into the text template). This is a hack, but it works.

The second change is that while all graduate students must eventually get a specific supervisor, not all of them have one initially when they arrive. In particular, there is one research group that accepts most new graduate students collectively and then sorts out who they will be supervised later, once the graduate students know more about the group and their own interests. In the past, this had been solved artificially by assigning nominal sponsors immediately even if they weren't going to be the student's supervisor, but eventually the group got tired of this and asked us to do better. The solution here was similar to the MScAC program (and staff accounts); we invented a synthetic 'supervisor' for them, with a suitable generic name. Unlike with the MScAC program, we didn't customize the Django templates for this new situation, and unfortunately the result does look a little ugly and awkward.

(This is where a general solution would have been useful. If we were templating this from a database table or the like, we could have just added a new entry for this general research group case. Adding another Django '{% if ...}' to the template would have made it too tangled, so we didn't.)

I don't think we did anything clever in our Django application's code or its data model. A lot of the changes we were able to make were inherent in having a system that was driven by database tables and being able to add relatively arbitrary things to those tables (with some hacks involved). Where our changes start breaking down is exactly where the limitations of that start appearing, such as multiple cases in templates when we didn't design that into the database.

(Could we have added it later? Perhaps. But I've always been too nervous about database migrations to modify our original database tables, partly because I've never done one with Django. This is a silly fear and in some ways it's holding back the evolution of our web application.)

PS: You might think that properly dealing with the co-supervision situation would make the research group situation easy to deal with, by just having new graduate students 'co-sponsored' by the entire research group. It's actually not clear if this is the right answer, because the situations are somewhat different on the Unix side. When you actively work with a supervisor, you normally get added to their Unix group so you can access group-specific things (if there are any), so for co-supervisors you should really get added to the Unix groups for both supervisors. However, it's not clear if people collectively sponsored by a research group should be added to every professor's Unix group in the same way. This implies that the Django application should know the difference between the two cases so that it can signal our Unix account creation process to treat them differently.

Sidebar: Our name hack for account sponsors

When someone goes to our web page to request an account, they have to choose their sponsor from a big <select> list of them. The list is sorted on the sponsor's last name, to make it easier to find. The idea of 'first name' and 'last name' is somewhat tricky (as is their order), and automatically picking them out from a text string is even harder. So we deal with the problem the other way around. Our Django data model has a 'first name' and a 'last name' field, but what they really mean is 'optional first part of the name' and 'last part of the name (that will determine the sort order)'.

As part of this, the synthetic account sponsors generally don't have a 'first name', because we want them to sort in order based on the full description (such as 'MScAC Graduate Student', which sorts in M not G or S).

(Sorting on 'last name' is somewhat arbitrary, but part of it is that we expect people requesting accounts to be more familiar with the last name of their sponsor than the first name.)