blastoise.helpmebot.org.uk was rebooted to enable a network configuration change to take effect (addition of a new network adaptor). Estimated downtime was 5 minutes.
However, the reboot was not without issues:
- DNS resolver configuration disappeared
- Puppet failed to restart
- nagios-hmb failed to restart
- Puppetmaster messed up with hostnames/certificates
The only major issue here was the DNS resolver configuration disappearing. What actually happened was the contents of the file /etc/resolv.conf changed from this:
... to this:
Now, that's slightly concerning, because before this reboot, I'd never seen that message before. And, as this config file was now effectively completely empty, I had zero possibility of doing hostname lookups. As all the software was configured to use dbmaster.helpmebot.org.uk and dbslaveX.helpmebot.org.uk as their database config, they couldn’t actually connect, so I shut down Apache to limit the damage.
It turns out that there’s a program called resolvconf, which has a few files in /etc/resolv-conf/ which control the contents of that file. Indeed, most of the original config was in /etc/resolv-conf/resolv.conf.d/original, so it was a simple case of putting the config into the correct place. This seemed to do the trick:
After that, restarting Apache was all that was needed to bring the site back up to where it should be. Total public-facing downtime was about 13 minutes.
Starting nagios-hmb and puppet back up was an easier one to solve, at least until I got some worrying stuff when I tried to run an update of puppet manually...
It turns out that when everything networkish failed earlier, our friend puppetmasterd somehow changed it's hostname or something, and all the certificates became invalid. With a bit of thanks to Wikitech, I managed to re-certify all the stuff, though I also removed /etc/puppet/ssl and restarted everything to do with puppet too.
After that, everything seems fine - if not please let me know ASAP.