Long-Delayed Server Upgrade

Since late 2014, my server has been running CentOS 7. While I’ve appreciated the stability of not doing major upgrades, ten years is a long time to accumulate forgotten or “under-documented” customizations. Like many people, I procrastinated updating until CentOS 7 reached EOL a couple of months ago.

The machine wasn’t nearly as out of date as the version might imply. Between software collections and some strategic local compiling, I was running fairly recent versions of most of the key packages. Even so, keeping all these self-built packages updated was taking up more and more time. The result was that on the pet versus cattle spectrum, this server is nearly as far over to the pet side as it gets.

I finally rebuilt the server by creating a new AlmaLinux 9 server and doing side-by-side setup until everything matched. The upgrade process was less painful than I expected after such a long gap, but it still took a couple weeks of testing to make sure everything was working properly. The DevOps people would probably say that I should replace all of this manual testing with an ansible/chef/puppet/salt setup, but as xkcd points out, the time spent on that would probably never be repaid for such an infrequent occurrence.

Instead, for posterity and my own future edification, here are a few things that went well or could have gone even better.

Document everything you spend time setting up

I had already been keeping a spreadsheet of things the machine was doing that were important to me. I was tracking packages installed, config files changed, and locations of data, along with notes about what goals I was trying to achieve.

Everything in the spreadsheet was a great starting point for making sure some particular feature was restored on the new server (especially notes about things that aren’t captured in obvious config files, like SELinux booleans or policy modifications). However, over the years, the spreadsheet had become somewhat incomplete and outdated. Going forward, I’m going to recommit myself to keeping this documentation updated whenever I install something new.

Use the distro package manager

I had already been using RPM to build updates to distro packages and a limited amount of local software. After realizing the advantages of playing nice with the distro package management system, I plan to do this even more in the future:

It tracks dependencies. In a lot of cases, I can easily pull in the new version of whatever library was needed before, or at least I can tell what’s missing.
It keeps track of which config files have been changed. If I hadn’t changed the default config before, there’s a good chance I can take the default with a new version. Conversely, if I had changed the default, I should look carefully at the changes.
It keeps (indirect) track of data locations. If data is stored in the default locations, I know right away which directories or files need to be migrated over.
The spec file in the RPM serves as a record of how to build the software, in case it isn’t as simple as configure; make; make install. Even if it is, the spec file records any special options I might have passed to configure.
Using something like dnf leaves and dnf history userinstalled lets me quickly figure out which things were intentionally added to the system. Other libraries that might no longer be available on the new version are probably irrelevant if they aren’t a dependency of something that was intentionally installed. If local software is part of the package dependency tree, I can quickly reinstall all the packages that were previously important enough to manually add without trying to rebuild every piece of the old system.
Things like dnf autoremove actually work right if locally installed software correctly tells the system which libraries it depends on. Otherwise a locally built piece of software might easily depend on a library that dnf otherwise thinks is unused.

As an example of something that didn’t go smoothly, I had a number of things installed through python venv. Because the version of python available on AlmaLinux 9 is not the same as anything available in CentOS 7, these were all broken after the upgrade. I had to figure out how to recreate them without any explicit documentation of what went into any given venv. If I had built these as RPMs, they would have automatically flagged that I needed to rebuild to get the current version of python (and they would have automatically included the necessary build steps in the spec file).

Remove unused stuff

I spent a fair amount of time rebuilding packages that it turned out I didn’t need at all. For example, back in 2014 I had migrated an entire LaTeX environment with a bunch of related printing packages from the previous version of the server. I no longer remember the details, but I suspect I had installed all of this to support writing papers when I was a grad student.

Most of the packages to support LaTeX and printing were no longer available on AlmaLinux 9. I initially started to recompile them all (which is fairly easy but time-consuming by importing RPMs from Fedora), but then I realized that I actually don’t use any of this stuff any more. If I had removed it when I was done with it in the first place, I would have saved several hours of effort, plus all the updates over the years that I wouldn’t have needed to install on the old server.

Separate local config from vendor config

More and more packages support some kind of drop-in configuration system. This usually looks something like a main /etc/daemon.conf that includes any files from /etc/daemon.conf.d automatically. An alternative version of this in some cases is to ship a default config under /usr and then override that with local versions from /etc. There were already a fair number of packages that worked like this back in 2014 when I set up the server, but I mostly considered it to be wasted effort to separate out my changes instead of just updating the main config file.

During the upgrade, I carefully merged the new default config file with whatever version was on my server. In a lot of cases, this had a bunch of spurious conflicts where things like default comments had changed. In the cases where I had carefully kept my changes in a separate file, it was mostly as easy as copying the file to the same location on the new server and double-checking that all the config options were still valid.

systemd seems to take this to a bit of an extreme with its complicated system of defaults that can be (entirely or partially) overridden by drop-ins in various places with various names. I’m not sure it’s a good idea to take full advantage of this because the resulting setup is very hard to understand. However, I do think the general “conf.d” or “separate local override” schemes have a lot of merit.

Do practice restores from backups

As the saying goes, you don’t really have backups until you’ve tested restoring from them. That was true here, too. I thought I had fairly thorough backups of most of the important data and configuration on the system, but the upgrade turned up a few surprises.

I had planned to transfer some of the services by simply restoring the latest backup onto the new machine. For example, the postgresql internal data formats change between versions, but the sql created by pg_dump can generally load into the new version just fine. One can’t easily copy the data files, but it’s usually straightforward to just import a backup. I had restored databases and done an in-place upgrade on the old server this way previously, so I assumed the backups contained everything.

It turns out that my backups didn’t include quite everything needed to do a from-scratch restore on a new machine. If I had shut down the old machine before starting the restore, I would have had to manually recreate part of the initial state before I could import the data. Luckily, I caught this while I still had both servers running and was able to fix the backups on the original server.

Conclusion

Overall, everything went pretty smoothly. The backups mostly worked, the configs mostly merged cleanly, and the packages missing from EPEL and/or AlmaLinux were mostly easy to recreate. I don’t think any of the parts that went particularly well were new ideas, but it was a good reminder that I should keep doing them.

I still don’t plan to go full scripted creation on this machine, but maybe doing a practice upgrade on a test machine every year or two will be worth it to keep my documentation updated.