Linode NextGen

Linode have been amazing as usual, and recently increased network capacity (bandwidth increased 1000%), and doubled CPU capacity (from 4-core to 8-core), and in the last few days they’ve doubled the RAM in each plan! This is all part of their free* upgrades that happen every now and again.

(* They’ve also adjusted the pricing plans to be easier and more friendly. Thus, their (for example) $19.95 plan has been increased to $20. Technically, it’s not free, but for 5 cents?)

The migrations to the new systems take a while – but I’m planning to do it in the middle of the night. Downtime could be a couple of hours. The bonus is with this reboot, the capacity of the entire cluster will be doubled. :o

In other news, I’m planning on retiring pikachu, it’s caused me nothing but trouble with it’s disk arrays, and the lag from Roubaix compared to the Linodes in London has been less than desirable. Plus, with the new capacity that the London cluster has, there’s no real advantage to having a dedicated machine out in Roubaix (apart from the BOINC work that it’s been sat doing cos it could).

If you’re interested in trying out Linode, they do a free trial! Just please mention this referral code: c88742a23087c8758c2824221626a8c1226c1736

Updates / upgrades / lots of downtime

Hi folks

I’m planning a series of upgrades which WILL impact services running.

blastoise (affects all)

  • OS upgrade
  • Disk space upgrade – Done!
  • General security fixes and package upgrades – Done!

metapod (affects all)

  • OS upgrade
  • Disk space upgrade – Done!
  • General security fixes and package upgrades – Done!

sos-staffs.co.uk

  • phpBB version upgrade – Done!

stwalkerster.co.uk

  • WordPress version upgrade – Done!

blog.stwalkerster.co.uk

  • WordPress version upgrade – Done!

helpmebot.org.uk

  • MediaWiki upgrade
  • MantisBT upgrade

I haven’t a clue how long that’s all gonna take, but I’m planning to start this afternoon with some stuff, maybe do the OS updates just after tea, hopefully to get most of it done before going to bed.

I ought to do this kind of thing more often so it doesn’t get to this amount of work to do.

Server maintenance 2011/01/17 00:00Z

blastoise.helpmebot.org.uk was rebooted to enable a network configuration change to take effect (addition of a new network adaptor). Estimated downtime was 5 minutes.

However, the reboot was not without issues:

  • DNS resolver configuration disappeared
  • Puppet failed to restart
  • nagios-hmb failed to restart
  • Puppetmaster messed up with hostnames/certificates

The only major issue here was the DNS resolver configuration disappearing. What actually happened was the contents of the file /etc/resolv.conf changed from this:

... to this:

Now, that's slightly concerning, because before this reboot, I'd never seen that message before. And, as this config file was now effectively completely empty, I had zero possibility of doing hostname lookups. As all the software was configured to use dbmaster.helpmebot.org.uk and dbslaveX.helpmebot.org.uk as their database config, they couldn’t actually connect, so I shut down Apache to limit the damage.

It turns out that there’s a program called resolvconf, which has a few files in /etc/resolv-conf/ which control the contents of that file. Indeed, most of the original config was in /etc/resolv-conf/resolv.conf.d/original, so it was a simple case of putting the config into the correct place. This seemed to do the trick:

After that, restarting Apache was all that was needed to bring the site back up to where it should be. Total public-facing downtime was about 13 minutes.

Starting nagios-hmb and puppet back up was an easier one to solve, at least until I got some worrying stuff when I tried to run an update of puppet manually...

It turns out that when everything networkish failed earlier, our friend puppetmasterd somehow changed it's hostname or something, and all the certificates became invalid. With a bit of thanks to Wikitech, I managed to re-certify all the stuff, though I also removed /etc/puppet/ssl and restarted everything to do with puppet too.

After that, everything seems fine - if not please let me know ASAP.

strategistsunion.co.uk server switch

Throughout today, I’m moving this domain over to the new server, blastoise.

However, it’s not just a simple move as many of the other hosted domains have been. I’ve got a bit more work to do for this one – as such it’s going to be a bit up/down for a while as I break stuff.

Sorry for the problems, I’m just trying to make this better for everyone

Service migration etc

Hi folks
TLDR: scroll down to the next bold bit

As part of the clustering setup I now have nicely running, I’m planning on moving practically everything over bit-by-bit to the new machine, then I’m gonna bring the old machine down for a rebuild.

I’m in the process of migrating DNS to the Linode DNS manager – a much easier and more central place to manage all of our DNS entries than various DynDNS/Domainmonster pages. This process is in progress now, but will take up to four hours to complete after I migrate a domain.

I pushed the “caskie” domain and helpmebot’s domain to Linode now, and will complete the rest later tonight.

My next step is to perform a database master switch from charmander to blastoise. This is something I’ve never done before, and therefore don’t know how smoothly it will go. I’ve got a few things to do first, (set up ALL database access hostnames to use the new roles in DNS rather than specific IP addresses and/or hostnames), but then I’m hoping it will be a simple case of set the master read-only, wait for the slave to catch up, stop replication and start it the other way, and then set the new master read-write. At some point I’ll have to update DNS too – but I’m not quite sure of the best point to do that.

When that’s done, I’ll start moving sites over to the new server slowly too, and then the few straggling services that run on charmander too.

Stuff you need to be aware about:

  • Database migration may cause up to 30 minutes downtime/disruption in the near future
  • Site migration may cause up to 30 minutes downtime/disruption in the near future

I’ll let you know closer to the time when, and I’ll try and keep it to a low-use time if possible.

Sorry for any inconvenience – I’m trying to build a better service :)

(Cross-posting to http://status.helpmebot.org.uk/2010/12/service-migration-etc.html )

Unplanned downtime 25/11/2010

Hi folks

This is just a notification of downtime that occurred today.

During a security improvement cleanup sweep of the file tree, a few files required by the system monitor Nagios were accidentally hidden away from it’s view. As such, it started reporting errors on several bits of security software at approximately 03:02. In order to protect the system, these “faulty” bits of software were taken down for immediate repair. However, the procedure to disable the security software also disabled the MySQL database server at approximately 03:10.

The loss of the MySQL server had the following effects:

  • Helpmebot lost connectivity and shut itself down.
  • Exim lost connectivity, and shut itself down.
  • Spamassassin was no longer being depended upon, and so was killed.
  • Due to the lack of Fail2Ban, vsFTPd shut itself down.
  • Due to an unfortunate configuration dependency, Apache was also shut down.

This chain of events effectively killed every service that was running on the server. The offending files were moved to a different location, and the majority of services were recovered by 03:21.

vsFTPd was not restored until after all apparent services had been re-enabled, and Nagios was happy (~ 03:30).

Final cleanup from this incident finished at approximately 04:16.

Apologies for the outage everyone, I’ll try not to do that again. :D