Hard disk failure

About a week or two ago, I discovered that one of the five 2TB hard disks in my home network server had… vanished. The system was recognising it as an approximately 3G disk, with no health status, and generally no idea what it was doing.

Needless to say, the storage pool wasn’t the happiest of things, and generally the entire system was almost unresponsive.

When I finally got into the OS, I discovered that one disk (disk 5; the failed one) was in a state of  ¯\_(ツ)_/¯, one disk was failing, and another disk was showing warnings

Naturally, the first disk I replaced was disk 5, as I couldn’t get anything out of that at all – connecting it to my PC to try and figure out what was going on caused sufficient disruption to my SATA controller that all other disks vanished from the UEFI configuration tool, but would still start a boot into Windows, which just hung completely. Removal of this disk put the system back to how it was immediately.

This led me to think it was a controller failure – and as my last disk to fail was also a Seagate disk of the same model, I considered doing a quick controller-swap to see if it was indeed a controller failure, but alas no. The “new” controller from the other disk exhibited the exact same symptoms.

Time to open it up!

Dark marks within the seal on the top cover

On first opening the drive, I noticed some dark marks on the lid of the drive, indicating that the internals of the drive were not the clean environment I would expect from a drive.

I also noticed some larger debris on the top platter of the disk – not as visible in the photo due to the presence of dust too – it took me a minute to realise I wanted photos of this and to find my camera. Notably, in contrast to everything else, the debris here is actually the larger bits.

Debris on the platter

One of the first things I noticed was that the little filter pad to the side of the disk was filthy with the same black marks as the lid of the drive, quickly followed by me noticing a read head was missing from the arm – more noticeable if the arm is swung onto the platters

More debris towards the back of the platter – note the missing read head

Obvious missing read head is obvious

It was also at this point that I noticed the surfaces inside the disk were horribly filthy as well, and I realised that this is a fine likely metallic powder.

Detail of obviously missing read head

Head parking – also covered in the powder

I began to disassemble the stack of platters, in the hope of finding this missing read head, and possibly more insight into what actually happened.

Removal of the retaining bracket at the top of the platter showed how filthy even the surface of the platter is

It didn’t take long to find out what happened – the underside of the top platter revealed this pretty patterns of dust, visible on the outer half of the platter still in the case, along with a very obvious wide scratch mark across the surface around the middle.

The next platter down revealed even more dust, along with a massive gouge near the centre of the disk – some serious force must have been involved in this. The entire surface of these platters is also covered in concentric rings of dust which just wipe clean

Needless to say, I think I have a fairly good idea why the drive failed, but there is still one mystery I’ve not solved – where did that read head go to?

For comparison, here’s a photo of another Seagate drive of the same model that also failed (nowhere near as spectacularly) about six months prior – the top read head is clearly visible

Another curious point is that upon replacement of this disk, the other disk that was reported as bad is suddenly only showing caution signs:

It’s possible that the disruption to the SATA bus was causing this to completely go haywire I suppose, that and the three disks with problems were all in the same cage – so it’s possible that one failing disk has caused issues with the others.

The cage to the rear, behind the memory modules, is the one which caused issues. The disk in question was at the bottom of this cage.

Heartbleed

Heartbleed logo

A couple of weeks ago, a security announcement (CVE-2014-0160) was made regarding a simple buffer over-read bug in a library.

This library is OpenSSL.

OpenSSL, for those who don’t know, is the library which deals with the encryption on the web (so stuff like https). A buffer over-read is something like this – let’s say I have a variable with three letters in it:

Hi!

What you don’t see is that this is stored in the middle of memory, so it’s actually surrounded by a load of other stuff:

sdflkjasHi!kldjlsd

With non-simple variables, such as blocks of text (character strings), a computer doesn’t directly know how big it is. It is up to the programmer to keep track of this when the variable is created. This allows a programmer to read back from memory a larger amount of data than is actually stored. This is almost always a bug, as the data that’s past the end is usually meaningless, and possibly dangerous depending on what is done with it.

If I was to read 5 letters from my 3 letter variable, I’d get this:

Hi!kl

What actually happened with Heartbleed was something very similar. Part of the protocol allows the one end to send a “heartbeat” to keep the connection alive. Part of this request allows the sender to specify some data to be returned back to them. As the server doesn’t know how long this is, you have to tell it too. xkcd has a really good explanation of this in comic form:

xkcd: Heartbleed Explanation

Of course, being able to read random bytes from a process’ memory is a bit of an issue for security, as there is usually some data in that memory that you don’t want the user to know. In a multi-user environment, this normally comes in the form of private data about other users. It’s a pretty serious security vulnerability on it’s own. However, this isn’t just any application, it’s OpenSSL – a widely-used public key encryption library.

Public Key Encryption

For those who don’t know what public key encryption is, here’s a quick primer.

If I want to protect some data when I send it to you, I would encrypt it. With classical encryption, I would create some key which we both have or know, and use that to encrypt the data, then I could be confident that it’s not going to be intercepted en-route to you. This does mean that I need to use a different key for everybody I send data to, and need a secure way of exchanging this data with everyone, which could cause a lot of issues.

Public key encryption solves this by creating a “key pair”, composed of a “private” key and a “public” key. Some complicated mathematical operation and relationship between these two keys means that the public key can be used to encrypt data, and then only the private key can be used to decrypt that data. The relationship between the keys means that the private key can never be derived from the public key, so you can freely distribute, publish, and shout from the hilltops your public key. Then, anyone who wants to get in contact with you can take your public key, and encrypt the data, and then only you could decrypt it.

Public key encryption is the base for SSL (secure sockets layer), which itself is the base for HTTPS.

Certificates, and Heartbleed

The public keys used for HTTPS are the certificates that your browser shows. Combined with the server’s private key, communication between the server and the browser can be completely encrypted, protecting the transmitted data.

The Heartbleed bug, however, allows any client to request about 64KiB of memory from the server, and to perform the encryption, this memory has to include the server’s private key. If the client is lucky, then they can download the private key (or the set of large prime numbers that make it) of the server and reconstruct the encryption keys locally.

Consequences of leaked private keys

There are quite a few consequences of a private key being leaked, especially when it’s been signed by a trusted third party (like almost all SSL certificates are signed by a CA). The obvious one is that any traffic you encrypt with this public key is now readable by anyone with the private key. Think along the lines of passwords, credit card numbers, etc.

What isn’t immediately thought of is the fact it’s possible to decrypt this data offline – if an attacker had saved your encrypted traffic from years ago, it’s possible that they could now decrypt that traffic. Implementing forward secrecy would go some way to mitigate this.

If the compromised credentials aren’t revoked (or the user doesn’t check revocation lists – the default in Google Chrome), then the attacker can then use those credentials to perform a future man-in-the-middle attack, and intercept any future data you send to a site.

Basically, if someone gets hold of that private key, you’re in trouble. Big trouble.

Recovery

By now, you should have updated to the fixed versions of OpenSSL, and made sure any services using the libraries have been restarted to apply the change.

You should also assume the worst-case scenario – private key data has been fully compromised, and is in the wild. Certificates should be revoked and re-issued BEFORE any other changes of secure credentials, otherwise the new credentials could also be leaked.

Anything that has the potential to cause adverse effects if exposed should be invalidated. This includes credentials like passwords, security keys, API keys, possibly even things like credit card numbers. From a security standpoint, this is the only way you can be completely safe from a leak due to this bug. From a reality standpoint, most people won’t be affected, and stuff like credit cards don’t need to be replaced, just keep a really close eye on statements and consult your bank the moment you notice something amiss.

Now would also be a really good time to start using password manager software like KeePass, so you can use strong and unique passwords for every service without worrying about remembering them all. KeePass uses an encrypted container with a single password (or key file, or both) that protects all the data stored within it.

The future

The bug is a simple missing bounds check. It’s the sort of bug that gets introduced thousands of times a day, and fixed thousands of times a day. Sometimes one or two slip through the gaps. Usually, they’re in non-security-critical code, so any exploitation of these doesn’t give an attacker much to work with.

Sometimes, just sometimes, one of these gets it’s way into a critical bit of code. Sometimes, this then gets released and isn’t spotted for a while. Who’s fault is it? Well, it’s really hard for a developer to see mistakes in their own code. That’s why we have code review. Not every bug is going to be seen by everyone at review stage. Nobody can reasonably assign blame here, and that’s probably not even the right thinking for this.

We should be looking at how we can reduce the frequency of events like this. For starters, this should be added to code review checklists (if it’s not there already!). OpenSSL should also add regression checks to make sure that the same bug isn’t introduced again.

It should be a global learning experience, not a situation where everyone points fingers.

OpenSSL is open-source software – the source code is freely available for anyone to view and edit. Anyone can review the code for security holes, and the developers have a disclaimer which states they supply the software “as-is”, so it’s the responsibility of the users of the software to make sure it works as expected. Open-source software is meant to be more secure because anyone can review it for security holes, but in practice I feel this rarely happens.

Let’s work together – the entire internet community – to make the internet a safer and more secure place. We have done really well with things like SSL, but bugs like this have seriously dented confidence in the security of the internet. Let’s work together to rebuild that trust to a level beyond what it was at before.

This site

This site, and the other sites hosted on this server were amongst the many thousands of other websites which have been hit by this bug. While there’s no indication that the bug was ever exploited, I cannot guarantee that it’s not been. Consequently, I’ve done the thing that so many other sites have done, and patched the hole, and reissued credentials and certificates as necessary.

Linode NextGen

Linode have been amazing as usual, and recently increased network capacity (bandwidth increased 1000%), and doubled CPU capacity (from 4-core to 8-core), and in the last few days they’ve doubled the RAM in each plan! This is all part of their free* upgrades that happen every now and again.

(* They’ve also adjusted the pricing plans to be easier and more friendly. Thus, their (for example) $19.95 plan has been increased to $20. Technically, it’s not free, but for 5 cents?)

The migrations to the new systems take a while – but I’m planning to do it in the middle of the night. Downtime could be a couple of hours. The bonus is with this reboot, the capacity of the entire cluster will be doubled. :o

In other news, I’m planning on retiring pikachu, it’s caused me nothing but trouble with it’s disk arrays, and the lag from Roubaix compared to the Linodes in London has been less than desirable. Plus, with the new capacity that the London cluster has, there’s no real advantage to having a dedicated machine out in Roubaix (apart from the BOINC work that it’s been sat doing cos it could).

If you’re interested in trying out Linode, they do a free trial! Just please mention this referral code: c88742a23087c8758c2824221626a8c1226c1736

Nagios: ?corewindow=cgi-bin/status.cgi

I’ve just found the solution to an issue which has been annoying me for months.

A bit of background – if you link to Nagios, you either link to the entire tool with it’s sidebar and main page, or you link to a specific page within it, losing the sidebar. Nagios includes a nice URL parameter which allows you to specify what the “main page” is, such as the service status view:

http://localhost/nagios/?corewindow=cgi-bin/status.cgi

However, if you try and click on any of the filters, such as “All Problems”, you end up getting an empty list.

It seems that the solution to this is to simply append the value ?host=all to the corewindow parameter:

http://localhost/nagios/?corewindow=cgi-bin/status.cgi%3Fhost=all

This makes it behave as expected. :)

mysql> SHOW TABLE STATUS; — It looks horrible!

TL;DR; – the results of MySQL’s SHOW TABLE STATUS is pretty horrible on a small terminal window, and most of the time shows way more than you’ll ever want to know for a quick summary.

Ivan Cachicatari over at http://en.latindevelopers.com/ivancp/2012/a-better-show-table-status/ came up with a stored procedure which nicely cuts it down to just the information that’s useful about a database, but I decided that most of the time it could be cut down even further:

This literally just shows the information about the tables, and not any of the crap associated with stuff I don’t use (interoperability is a pain) such as stored procedures and triggers, and I also rarely use views so that went too.

Shell-freezing Ctrl-S and other keyboard shortcuts

For a while now I’ve known about the rather annoying Linux shell freezing that happens when you press Ctrl-S. At first, it seems like a connection drop-out if it’s remote (I use the Linux shell a lot, via PuTTY on Windows), but it took me a few months to realise that it was when I pressed a key. A few months later, I learnt it was caused by Ctrl-S, when I meant to press Ctrl-D (EOF).

Apparently, this is intended behaviour – a scroll lock of sorts. Ctrl-S again doesn’t solve it, but Ctrl-Q will.

So, here’s a few other keyboard shortcuts you may or may not find useful – not all of these may work on every unix/linux variant.

Ctrl-AMoves the cursor to the beginning of the line. See Ctrl-E.
Ctrl-BMoves the cursor backward one character. See Ctrl-F.
Ctrl-CSIGINT – interrupts (cancels)
Ctrl-DEOF – end of file character, marks the end of user input. If used on an interactive shell, will log you out. If used on a STDIN stream, will mark the end of the stream so it stops expecting input from you.(These are essentially different angles on the same thing)
Ctrl-EMoves the cursor to the end of the line. See Ctrl-A.
Ctrl-FMoves the cursor forward one character. See Ctrl-B.
Ctrl-HBackspace
Ctrl-LClears the screen (like the clear command)
Ctrl-NNext item in history. See Ctrl-P
Ctrl-PPrevious item in history. See Ctrl-N
Ctrl-QXON – Resumes screen output. See Ctrl-S.
Ctrl-RReverse history search.
Ctrl-SXOFF – Suspends screen output. See Ctrl-Q.
Ctrl-UDelete line
Ctrl-WDelete last word
Ctrl-ZSIGTSTP – suspend a process. Can be resumed with either fg or bg

Thanks to http://www.computerhope.com/ushort.htm and http://www.cdrinfo.com/Sections/Reviews/Specific.aspx?ArticleId=19672 for some of the info here.