Unplanned downtime 25/11/2010

Hi folks

This is just a notification of downtime that occurred today.

During a security improvement cleanup sweep of the file tree, a few files required by the system monitor Nagios were accidentally hidden away from it’s view. As such, it started reporting errors on several bits of security software at approximately 03:02. In order to protect the system, these “faulty” bits of software were taken down for immediate repair. However, the procedure to disable the security software also disabled the MySQL database server at approximately 03:10.

The loss of the MySQL server had the following effects:

  • Helpmebot lost connectivity and shut itself down.
  • Exim lost connectivity, and shut itself down.
  • Spamassassin was no longer being depended upon, and so was killed.
  • Due to the lack of Fail2Ban, vsFTPd shut itself down.
  • Due to an unfortunate configuration dependency, Apache was also shut down.

This chain of events effectively killed every service that was running on the server. The offending files were moved to a different location, and the majority of services were recovered by 03:21.

vsFTPd was not restored until after all apparent services had been re-enabled, and Nagios was happy (~ 03:30).

Final cleanup from this incident finished at approximately 04:16.

Apologies for the outage everyone, I’ll try not to do that again. :D

2 thoughts on “Unplanned downtime 25/11/2010

  1. Score dude!

    Seriously though, it didn’t cause too much of an issue. Don’t beat yourself up over it.

    Cheers,

    Sven

Comments are closed.