The cluster welcomes it’s third member! Caterpie was created from a need – the need to a Minecraft SMP server to be hosted *somewhere*. My previous plans for the third cluster machine haven’t gone out the window – those plans have just moved almost entirely to metapod :P is an Amazon Web Services Elastic Compute Cluster (EC2) node – based on the “free tier” of services. The free tier doesn’t allow much in the way of memory, so I’m not entirely sure how well this will run.

It may turn out that the whole thing doesn’t actually work, in which case I’ll scrap it, but it’s worth a try. :)

Apologies about the HUGE hole near the spawn point – I thought I’d play with the server commands and TNT… ohai bedrock. :D

Server maintenance 2011/01/17 00:00Z was rebooted to enable a network configuration change to take effect (addition of a new network adaptor). Estimated downtime was 5 minutes.

However, the reboot was not without issues:

  • DNS resolver configuration disappeared
  • Puppet failed to restart
  • nagios-hmb failed to restart
  • Puppetmaster messed up with hostnames/certificates

The only major issue here was the DNS resolver configuration disappearing. What actually happened was the contents of the file /etc/resolv.conf changed from this:

... to this:

Now, that's slightly concerning, because before this reboot, I'd never seen that message before. And, as this config file was now effectively completely empty, I had zero possibility of doing hostname lookups. As all the software was configured to use and as their database config, they couldn’t actually connect, so I shut down Apache to limit the damage.

It turns out that there’s a program called resolvconf, which has a few files in /etc/resolv-conf/ which control the contents of that file. Indeed, most of the original config was in /etc/resolv-conf/resolv.conf.d/original, so it was a simple case of putting the config into the correct place. This seemed to do the trick:

After that, restarting Apache was all that was needed to bring the site back up to where it should be. Total public-facing downtime was about 13 minutes.

Starting nagios-hmb and puppet back up was an easier one to solve, at least until I got some worrying stuff when I tried to run an update of puppet manually...

It turns out that when everything networkish failed earlier, our friend puppetmasterd somehow changed it's hostname or something, and all the certificates became invalid. With a bit of thanks to Wikitech, I managed to re-certify all the stuff, though I also removed /etc/puppet/ssl and restarted everything to do with puppet too.

After that, everything seems fine - if not please let me know ASAP. server switch

Throughout today, I’m moving this domain over to the new server, blastoise.

However, it’s not just a simple move as many of the other hosted domains have been. I’ve got a bit more work to do for this one – as such it’s going to be a bit up/down for a while as I break stuff.

Sorry for the problems, I’m just trying to make this better for everyone

Service migration etc

Hi folks
TLDR: scroll down to the next bold bit

As part of the clustering setup I now have nicely running, I’m planning on moving practically everything over bit-by-bit to the new machine, then I’m gonna bring the old machine down for a rebuild.

I’m in the process of migrating DNS to the Linode DNS manager – a much easier and more central place to manage all of our DNS entries than various DynDNS/Domainmonster pages. This process is in progress now, but will take up to four hours to complete after I migrate a domain.

I pushed the “caskie” domain and helpmebot’s domain to Linode now, and will complete the rest later tonight.

My next step is to perform a database master switch from charmander to blastoise. This is something I’ve never done before, and therefore don’t know how smoothly it will go. I’ve got a few things to do first, (set up ALL database access hostnames to use the new roles in DNS rather than specific IP addresses and/or hostnames), but then I’m hoping it will be a simple case of set the master read-only, wait for the slave to catch up, stop replication and start it the other way, and then set the new master read-write. At some point I’ll have to update DNS too – but I’m not quite sure of the best point to do that.

When that’s done, I’ll start moving sites over to the new server slowly too, and then the few straggling services that run on charmander too.

Stuff you need to be aware about:

  • Database migration may cause up to 30 minutes downtime/disruption in the near future
  • Site migration may cause up to 30 minutes downtime/disruption in the near future

I’ll let you know closer to the time when, and I’ll try and keep it to a low-use time if possible.

Sorry for any inconvenience – I’m trying to build a better service :)

(Cross-posting to )

Unplanned downtime 25/11/2010

Hi folks

This is just a notification of downtime that occurred today.

During a security improvement cleanup sweep of the file tree, a few files required by the system monitor Nagios were accidentally hidden away from it’s view. As such, it started reporting errors on several bits of security software at approximately 03:02. In order to protect the system, these “faulty” bits of software were taken down for immediate repair. However, the procedure to disable the security software also disabled the MySQL database server at approximately 03:10.

The loss of the MySQL server had the following effects:

  • Helpmebot lost connectivity and shut itself down.
  • Exim lost connectivity, and shut itself down.
  • Spamassassin was no longer being depended upon, and so was killed.
  • Due to the lack of Fail2Ban, vsFTPd shut itself down.
  • Due to an unfortunate configuration dependency, Apache was also shut down.

This chain of events effectively killed every service that was running on the server. The offending files were moved to a different location, and the majority of services were recovered by 03:21.

vsFTPd was not restored until after all apparent services had been re-enabled, and Nagios was happy (~ 03:30).

Final cleanup from this incident finished at approximately 04:16.

Apologies for the outage everyone, I’ll try not to do that again. :D