How I think I want to drop modern Python packages into a single program

For reasons beyond the scope of this blog entry, I'm considering augmenting our Python program to log email attachment information for Exim to use oletools to peer inside MS Office files for indications of bad things. Oletools is not packaged by Ubuntu as far as I can see, and in any case it would be an older version, so we would need to add the oletools Python packages ourselves.

The official oletools install instructions talk about using either pip or As a general rule, we're very strongly against installing anything system-wide except through Ubuntu's own package management system, and the environment our Python program runs in doesn't really have a home directory to use pip's --user option, so the obvious and simple pip invocations are out. I've used a approach to install a large Python package into a specific directory hierarchy in the past (Django), and it was a big pain, so I'd like not to do it again.

(Nor do we want to learn about how to build and maintain Python virtual environments, and then convert how we run this Python program to use one.)

After some looking at pip's help output I found the 'pip install --target <directory>' option and tested it a bit. This appears to do more or less what I want, in that it installs oletools and all of its dependencies into the target directory. The target directory is also littered with various metadata, so we probably don't want to make it where the program's normal source code lives. This means we'll need to arrange to run the program so that $PYTHONPATH is set to the target directory, but that's a solvable problem.

(This 'pip install' invocation does write some additional pip metadata to your $HOME. Fortunately it actually does respect the value of the $HOME environment variable, so I can point that at a junk directory and then delete it afterward. Or I can make $HOME point to my target directory so everything is in one place.)

All of this is not quite as neat and simple as dropping an oletools directory tree in the program's directory, in the way that I could deal with needing the rarfile module, but then again oletools has a bunch of dependencies and pip handles them all for me. I could manually copy them all into place, but that would actually create a sufficiently cluttered program directory that I prefer a separate directory even if it needs a $PYTHONPATH step.

(Some people will say that setting $PYTHONPATH means that I should go all the way to a virtual environment, but that would be a lot more to learn and it would be more opaque. But looking into this a bit did lead to me learning that Python 3 now has standard support for virtual environments.)

Python 3 venvs don't normally really embed their own copy of Python (on Unix)

Python 3 has standard support for virtual environments. The documentation describes them in general as:

[...] a self-contained directory tree that contains a Python installation for a particular version of Python, plus a number of additional packages.

As a system administrator, a 'self-contained directory tree' that has a particular version of Python is a scary thing to read about, because it implies that the person responsible for keeping that version of Python up to date on security patches, bug fixes, and so on is me, not my Unix distribution. I don't want to have to keep up with Python in that way; I want to delegate it to the fine people at Debian, Canonical, Red Hat, or whoever.

(I also don't want to have to keep careful track of all of the virtual environments we might be running so that I can remember to hunt all of them down to update them when a new version of Python is released.)

Fortunately it turns out that the venv system doesn't actually do this (based on my testing on Fedora 31 with Python 3.7.9, and also a bit on Ubuntu 18.04). Venv does create a <dir>/bin/python for you, but under normal circumstances this is a symlink to whatever version of Python you ran the venv module with. On Linux, by default this will be the system installed version of Python, which means that a normal system package update of it will automatically update all your venvs too.

(As usual, currently running processes will not magically be upgraded; you'll need to restart them.)

This does however mean that you can shoot yourself in the foot by moving a venv around or by upgrading the system significantly enough. The directory tree created contains directories that include the minor version of Python, such as the site-packages directory (normally found as <dir>/lib/python3.<X>/site-packages). If you upgrade the system Python to a new minor version (perhaps by doing a Linux distribution version upgrade, or by replacing the server with a new server running a more current version), or you move the venv between systems with different Python minor versions, your venv probably won't work because it's looking in the wrong place.

(For instance, Ubuntu 18.04 LTS has Python 3.6.9, Fedora 31 has Python 3.7.9, and Ubuntu 20.04 LTS has Python 3.8.2. I deal with all three at work.)

You can easily fix this with 'python3 -m venv --upgrade <DIR>', but you have to remember that you need to do this. The good news is that whatever is trying to use the venv is probably going to fail immediately, so you'll know right away that you need it.

PS: One way to 'move' a venv between systems this way is to have an environment with multiple versions of a Linux (as we do), and to build venvs on NFS filesystems that are mounted everywhere.

Why I write recursive descent parsers (despite their issues)

Today I read Laurence Tratt's Which Parsing Approach? (via), which has a decent overview of how parsing computer languages (including little domain specific languages) is not quite the well solved problem we'd like it to be. As part of the article, Tratt discusses how recursive descent parsers have a number of issues in practice and recommends using other things, such as a LR parser generator.

I have a long standing interest in parsing, I'm reasonably well aware of the annoyances of recursive descent parsers (although some of the issues Tratt raised hadn't occurred to me before now), and I've been exposed to parser generators like Yacc. Despite that, my normal approach to parsing any new little language for real is to write a recursive descent parser in whatever language I'm using, and Tratt's article is not going to change that. My choice here is for entirely pragmatic reasons, because to me recursive descent parsers generally have two significant advantages over all other real parsers.

The first advantage is that almost always, a recursive descent parser is the only or at least easiest form of parser you can readily create using only the language's standard library and tooling. In particular, parsing LR, LALR, and similar formal grammars generally requires you to find, select, and install a parser generator tool (or more rarely, an additional package). Very few languages ship their standard environment with a parser generator (or a lexer, which is often required in some form by the parser).

(The closest I know of is C on Unix, where you will almost always find some version of lex and yacc. Not entirely coincidentally, I've used lex and yacc to write a parser in C, although a long time ago.)

By contrast, a recursive descent parser is just code in the language. You can obviously write that in any language, and you can build a little lexer to go along with it that's custom fitted to your particular recursive descent parser and your language's needs. This also leads to the second significant advantage, which is that if you write a recursive descent parser, you don't need to learn a new language, the language of the parser generator, and also learn how to hook that new language to the language of your program, and then debug the result. Your entire recursive descent parser (and your entire lexer) are written in one language, the language you're already working in.

If I was routinely working in a language that had a well respected de facto standard parser generator and lexer, and regularly building parsers for little languages for my programs, it would probably be worth mastering these tools. The time and effort required to do so would be more than paid back in the end, and I would probably have a higher quality grammar too (Tratt points out how recursive descent parsers hide ambiguity, for example). But in practice I bounce back and forth between two languages right now (Go and Python, neither of which have such a standard parser ecology), and I don't need to write even a half-baked parser all that often. So writing another recursive descent parser using my standard process for this has been the easiest way to do it every time I needed one.

(I've developed a standard process for writing recursive descent parsers that makes the whole thing pretty mechanical, but that's a discussion for another entry or really a series of them.)

PS: I can't comment about how easy it is to generate good error messages in modern parser generators, because I haven't used any of them. My experience with my own recursive descent parsers is that it's generally straightforward to get decent error messages for the style of languages that I create, and usually simple to tweak the result to give clearer errors in some specific situations (eg, also).

When the Go garbage collector will panic over bad pointer values

For some time, I've vaguely remembered that the Go garbage collector actually checked Go pointer values and would panic if it found that an alleged pointer (including unsafe.Pointer values) didn't point to a valid object. Since the garbage collector may interrupt you at almost random points, this would make it very dangerous to play around with improper unsafe.Pointer values. However, this was just a superstitious memory, so today I decided to find out what the situation is in current Go by reading the relevant runtime source code (for the development version of Go, which is just a bit more recent than Go 1.15 as I write this).

As described in Allocator Wrestling (see also, and), Go allocates ordinary things (including goroutine stacks) from chunks of memory called spans that are themselves allocated as part of arenas. Arenas (and spans) represent address space that is used as part of the Go heap, but they may not currently have all of their memory allocated from the operating system. A Go program always has at least one arena created as part of its address space.

Based on reading the code, I believe that the Go garbage collector panics if it finds a Go pointer that points inside a created arena but is not within the bounds of a span that is currently in use (including spans used for stacks). The Go garbage collector completely skips checking pointers that don't fall within a created arena; the comment in the source code says '[t]his pointer may be to some mmap'd region, so we allow it', which might lead you to think that it's talking about your potential use of mmap(), but the Go runtime itself allocates a number of things outside of arenas in things it has mmap()'d and obviously the garbage collector can't panic over pointers to them.

The address space available on 64-bit machines is very large and many Go programs will use only a small portion of it for created arenas. The practical consequence of this is that many random 'pointer' values will not fall within the bounds of your program's arenas and so won't trigger garbage collector panics. You're probably more likely to produce these panics if you start with valid Go pointers and then manipulate them in sufficiently improper ways (but not so improperly that the pointer value flies off too far).

(So my superstitious belief has some grounding in reality but was probably way too broad. It's certainly not safe to put bad values in unsafe.Pointers, but in practice most bad values won't be helpfully diagnosed with panics from the garbage collector; instead you'll get other, much more mysterious issues when you try to use them for real.)

An additional issue is that spans are divided up into objects, not all of which are necessarily allocated at a given time. The current version of the garbage collector doesn't seem to attempt to verify that all pointers point to allocated objects inside spans, so I believe that if you're either lucky or very careful in your unsafe.Pointer manipulation, you can create a non-panicing pointer to a currently free object that will later be allocated and used by someone else.

(It's possible that such a pointer could cause garbage collector panics later on under some circumstances.)

The Go runtime also contains a much simpler pointer validity check (and panic) in the code that handles copying and adjusting goroutine stacks when they have to grow. This simply looks for alleged pointers that have a value that's 'too small' (but larger than 0), where too small is currently 4096. I believe that such bad pointers will pass the garbage collector's check, because they point well outside any created arena.

Both of these panics can be turned off with the same setting in $GODEBUG, as covered in the documentation for the runtime package. As you would expect, the setting you want is 'invalidptr=0'.

People who want to see the code for this should look in runtime/mbitmap.go's findObject(), runtime/mheap.go's spanOf(), and runtime/stack.go's adjustpointers().

I'm now a user of Vim, not classical Vi (partly because of windows)

In the past I've written entries (such as this one) where I said that I was pretty much a Vi user, not really a Vim user, because I almost entirely stuck to Vi features. In a comment on my entry on not using and exploring Vim features, rjc reinforced this, saying that I seemed to be using vi instead of vim (and that there was nothing wrong with this). For a long time I thought this way myself, but these days this is not true any more. These days I really want Vim, not classical Vi.

The clear break point where I became a Vim user instead of a Vi user was when I started internalizing and heavily using Vim's (multi-)window commands (also). I started this as far back as 2016 (as signalled by this entry), but it took a while before I really had the window commands sink in and habits regarding them become routine (like using 'vi -o' on most occasions when I'm editing multiple files). I'm not completely fluid with Vim windows and I certainly haven't mastered all the commands, but at this point I definitely don't want to go back to not having them available.

(In my old vi days, editing multiple files was always a pain point where I would start reaching for another editor. I just really want to see more than one file on a screen at once in my usual editing style. Sometimes I want to see more than one spot in a file at the same time, too, especially when coding.)

I also very much want Vim's unlimited undo and redo, instead of a limited size undo. There are a bunch of reasons for this, but one of them is certainly that the Vi command set makes it rather easy to accidentally do a second edit operation as you're twitching around before you realize that you actually want to undo the first one. This is especially the case if your edit operation was an accident (where you hit the wrong keys by mistake or didn't realize that you weren't in insert mode), or if you've developed the habit of reflexively reflowing your current paragraph any time you pause in writing.

(There are probably other vim features I've become accustomed to without realizing it or without realizing that they're Vim features, not basic Vi features. Everywhere I use 'vi', it's really Vim.)

Although I'm now unapologetically using vim, my vimrc continues to be pretty minimal and is mostly dedicated to turning things off and setting sensible (ie modern) defaults, instead of old vi defaults. I'm unlikely to ever try to turn my vim into a superintelligent editor for reasons beyond the scope of this entry.

(I do use one Vim plugin in some of my vim setups, Aristotle Pagaltzis' vim-buftabline. I would probably be more enthused about it if I edited lots of files at once in my vim sessions, but usually I don't edit more than a couple at once.)

Rolling distribution releases versus periodic releases are a tradeoff

In reaction to my entry on the work involved for me in upgrading Fedora, Ben Cotton wrote a useful entry, What do “rolling release” and “stable” mean in the context of operating systems?. In the entry, Ben Cotton sort of mentioned something in passing that I want to emphasize, which is that the choice between a rolling release and a periodic release is tradeoff, not an option where there is a clear right answer.

In the Linux world, fundamentally things change because the upstreams of our software change stuff around. Firefox drops support for old style XUL based addons (to people's pain); Gnome moves from Gnome 2 to Gnome 3 (an interface change that I objected to very strongly); Upstart loses out to systemd; Python 2 stops being supported; and so on. As people using a distribution, we cannot avoid these changes for long, and attempting to do so gives you 'zombie' distributions. So the question is when we get these changes inflicted on us and how large they are.

In a rolling release distribution, you get unpredictably spaced changes of unpredictable size but generally not a lot of change at once. Your experience is likely going to be a relatively constant small drumbeat of changes, with periodic bigger ones. Partly this is because large projects don't all change things at the same time (or even do releases at the same time), and partly this is because the distribution itself is not going to want to try to shove too many big changes in at once even if several upstreams all do big releases in close succession.

In a periodic release distribution, you get large blocks of change at predictable points (when a new release is made and you upgrade), but not a lot of change at other times. When you upgrade you may need to do a lot of adjustment at once, but other than that you can sit back. In addition, if something changes in your environment it may be hard to figure out what piece of software caused the change and what you can do to fix it, because so many things changed at the same time.

(In a rolling release distribution, you can often attribute a change in your environment to a specific update of only a few things that you just did.)

Neither of these choices of when and how to absorb changes is 'right'; they are a tradeoff. Some people will prefer one side of the tradeoff, and other people will prefer the other. Neither is wrong (or right), because it is a preference, and people can even change their views of what they want over time or in different circumstances.

(Although you might think that I come down firmly on the side of rolling releases for my desktops, I'm actually not sure that I would in practice. I may put off Fedora releases a lot because of how much I have to do at once, but at the same time I would probably get very irritated if I was frequently having to fiddle with some aspect of my custom non-desktop. It's a nice thing that I got everything working at the start of Fedora 31 and haven't had to touch it since.)

Some notes on what Fedora's DNF logs and where

In comments on my entry on why Fedora release upgrades are complicated and painful for me, Ben Cotton wound up asking me to describe my desired experience for DNF's output during a release upgrade. This caused me to go out and look at what DNF actually logs today (as opposed to its console output), so here are some notes. The disclaimers are that this is on my Fedora systems, which I think are reasonably stock but may not be, and that this isn't documented that I could find in a quick skim of DNF manpages, so I'm probably wrong about parts.

DNF logs to /var/log in three separate files, dnf.log, dnf.rpm.log, and dnf.librepo.log. Of these, dnf.librepo.log appears to be the least interesting, as all my version has is information about what metadata and packages have been downloaded and some debugging information if checksums don't match.

The dnf.log file contains copies of the same information about what package updates will be done and were done as dnf itself prints interactively. It also contains a bunch of debug information about DNF's cache manipulation, usage of deltarpm, the dates of repository metadata, and other things of less interest (at least to me). It looks like it's possible to reconstruct most or all of your DNF command lines from the information here, which could be useful under some circumstances.

Finally, dnf.rpm.log has the really interesting stuff. This seems to be a verbose log of the RPM level activity that DNF does (or did during a upgrade or package install). This includes the actual packages upgraded and removed, verbose information about .rpmnew and .rpmsave files being created and manipulated (which is normally printed by RPM itself), and what seems to be a copy of most output from RPM package scripts, including output that doesn't seem to normally get printed to the terminal by DNF. This is a gold mine if you want to go back through an upgrade to look for RPM package messages that you didn't spot at the time, although you'll have to pick through a lot of debugging output.

(I initially thought that dnf.rpm.log contained all output, but at lease during Fedora release upgrades it appears to miss some things that are printed to the terminal based on my notes and script captures.)

When DNF (or perhaps RPM via DNF) reports upgrades interactively, these go in two stages; the new version of the package is upgraded (ie installed), which will run its postinstall script, and then later there is a cleanup of the old version of the package, which will run its postrm script (if any). dnf.rpm.log doesn't appear to use the same terminology when it logs the second phase. The upgrade phase appears as 'SUBDEBUG Upgrade: ...', but the cleanup phase seems to be reported as 'SUBDEBUG Upgraded: ...'. If you remove something, for example because an old kernel is being removed when you install a new one, it's reported as 'SUBDEBUG Erase:'. When a new package is installed (including a new kernel), it is reported as 'SUBDEBUG Installed:', instead of the 'Install:' that you might expect for symmetry with upgrades.

(I don't know how downgrades or obsoletes are reported; I haven't dug through my DNF logs that much.)

Unlike interactive DNF, dnf.rpm.log doesn't record the mere fact that scriptlets have been run. If they're run and don't produce any output that the log captures, they're invisible. This is probably not a problem for logging purposes; interactively, it's mostly useful as a hint to why DNF seems to be sitting around not doing anything.

None of these logs are a complete replacement for capturing a DNF session with script, as far as I can tell (although some of my information here is effectively from the Fedora 30 version of DNF, not the Fedora 31 or 32 ones). However they're at least a useful supplement, and skimming them is faster than using 'less -r ...' on a script capture of a DNF session.

My take on permanent versus temporary HTTP redirects in general

When I started digging into the HTTP world (which was around the time I started writing DWiki), the major practical difference between permanent and temporary HTTP redirects was that browsers aggressively cached permanent redirects. This meant that permanent redirects were somewhat of a footgun; if you got something wrong about the redirect or changed your mind later, you had a problem (and other people could create problems for you). While there are ways to clear permanent redirects in browsers, they're generally so intricate that you can't count on visitors to do them (here's one way to do it in Firefox).

(Since permanent redirects fix both that the source URL is being redirected and what the target URL is, they provide not one but two ways for what you thought was permanent and fixed to need to change. In a world where cool URLs change, permanence is a dangerous assumption.)

Also, back then in theory syndication feed readers, web search engines, and other things that care about the canonical URLs of things would use a permanent redirect as a sign to update what that was. This worked some of the time in some syndication feed readers for updating feed URLs, but definitely not always; software authors had to go out of their way to do this, and there were things that could go wrong (cf). Even back in the days I don't know if web search engines paid much attention to it as a signal.

All of this got me to use temporary redirections almost all of the time, even in situations where I thought that the redirection was probably permanent. That Apache and other things made temporary redirections the default also meant that it was somewhat easier to set up my redirects as temporary instead of permanent. Using temporary redirects potentially meant somewhat more requests and a somewhat longer delay before some people with some URLs got the content, but I didn't really care, not when set against the downsides of getting a permanent redirect wrong or needing to change it after all.

In the modern world, I'm not sure how many people will have permanent HTTP redirects cached in their browsers any more. Many people browse in more constrained environments where browsers are throwing things out on a regular basis (ie phones and tablets), browsers have probably gotten at least a bit tired of people complaining about 'this redirect is stuck', and I'm sure that some people have abused that long term cache of permanent redirects to fingerprint their site visitors. On the one hand, this makes the drawback of permanent redirects less important, but on the other hand this makes their advantages smaller.

Today I still use temporary redirects most of the time, even for theoretically permanent things, but I'm not really systematic about it. Now that I've written this out, maybe I will start to be, and just say it's temporary for me for now onward unless there's a compelling reason to use a permanent redirect.

(One reason to use a permanent redirect would be if the old URL has to go away entirely at some point. Then I'd want as a strong as signal as possible that the content really has migrated, even if only some things will notice. Some is better than none, after all.)

Permanent versus temporary redirects when handling extra query parameters on your URLs

In yesterday's entry on what you should do about extra query parameters on your URLs, I said that you should answer with a HTTP redirect to the canonical URL of the page and that I thought this should be a permanent redirect instead of a temporary one for reasons that didn't fit into the entry. Because Aristotle Pagaltzis asked, here is why I think permanent redirects are the right option.

As far as I know, there are two differences in client behavior (including web spider behavior) between permanent HTTP redirects and temporary ones, which is that clients don't cache temporary redirects and don't consider them to change the canonical URL of the resource. If you use permanent redirects, you thus probably make it more likely that web search engines will conclude that your canonical URL really is the canonical URL and they don't need to keep re-checking the other one, at the potential downside of having browsers cache the redirect and never re-check it.

So the question is if you'll ever want to change the redirect or otherwise do something else when you get a request with those extra query parameters. My belief is that this is unlikely. To start with, you're probably not going to reuse other people's commonly used extra query parameters for real query parameters of your own, because other people use them and will likely overwrite your values with theirs.

(In related news, if you were previously using a 's=..' query parameter for your own purposes on URLs that people will share around social media, someone out there has just dumped some pain on top of you. Apparently it may be Twitter instead of my initial suspect of Slack, based on a comment on this entry.)

If you change the canonical URL of the page, you're going to need a redirect for the old canonical URL anyway, so people with the 'extra query parameters' redirect cached in their browser will just get another redirect. They can live with that.

The only remaining situation I can think of where a cached permanent redirection would be a problem would be if you want to change your web setup so that you deliberately react to specific extra query parameters (and possibly their values) by changing your redirects or rendering a different version of your page (without a redirect). This strikes me as an unlikely change for most of my readers to want to make (and I'm not sure how common customizing pages to the apparent traffic source is in general).

(Also, browsers don't cache permanent redirects forever, so you could always turn the permanent redirects into temporary ones for a few months, then start doing the special stuff.)

PS: I don't think most clients do anything much about changing the 'canonical URL' of a resource if the initial request gets a permanent redirect. Even things like syndication feed readers don't necessarily update their idea of your feed's URL if you provide permanent redirects, and web browsers are even less likely to change things like a user's bookmarks. These days, even search engines may more or less ignore it, because people do make mistakes with their permanent redirects.

What you should do about extra query parameters on your URLs

My entry on how web server laxness created a de facto requirement to accept arbitrary query parameters on your URLs got a number of good comments, so I want to agree with and magnify the suggestion about what to do about these parameters. First off, you shouldn't reject web page requests with extra query parameters. I also believe that you shouldn't just ignore them and serve the regular version of your web page. Instead, as said by several commentators, you should answer with a HTTP redirect to the canonical URL of the web page, which will be stripped of at least the extra query parameters.

(I think that this should be a permanent HTTP redirect instead of a temporary one for reasons that don't fit within the margins of this entry. Also, this assumes that you're dealing with a GET or a HEAD request.)

Answering with a HTTP redirect instead of the page has two useful or important effects, as pointed out by commentators on that entry. First, any web search engines that are following those altered links won't index duplicate versions of your pages and get confused about which is the canonical one (or downrate you in results for having duplicate content). Second, people who copy and reshare the URL from their browser will be sharing the canonical URL, not the messed up version with tracking identifiers and other gunk. This assumes that you don't care about those tracking identifiers, but I think this is true for most of my readers.

(In addition, you can't count on other people's tracking identifiers to be preserved by third parties when your URLs get re-shared. If you want to track that sort of stuff, you probably need to add your own tracking identifier. You might care about this if, for example, you wanted to see how widely a link posted on Facebook spread.)

However, this only applies to web pages, not to API endpoints. Your API endpoints (even GET ones) should probably error out on extra query parameters unless there is some plausible reason they would ever be usefully shared through social media. If your API endpoints never respond with useful HTML to bare GETs, this probably doesn't apply. If you see a lot of this happening with your endpoints, you might make them answer with HTTP redirects to your API documentation or something like that instead of some 4xx error status.

(But you probably should also try to figure out why people are sharing the URLs of your API endpoints on social media, and other people are copying them. You may have a documentation issue.)

PS: As you might suspect, this is what DWiki does, at least for the extra query parameters that it specifically recognizes.