Recovering from a server hardware failure (or not)

When was the last time your hosting company phoned you to ask you why your server went down? This happened at 10Pm on a Monday night while I doing some updates on our dedicated server. After windows restarted it just never came back on again.

The culprit was the RMM module installed on the motherboard (presumably). This meant that the motherboard had to be replaced, which in turn meant that I would silently be screaming for next couple of hours, because Windows Server 2012 would not boot with changed hardware. Doing a repair on Windows had a 20% success rate and I did not have success…

So we had to disconnect one of the RAID drives by instructing the technician in the datacenter what to do via email from 10pm to 2am. They eventually told me that they would only be available again at 6am. I was up at 6am with a mail sent out, but they hadn’t had their coffee yet. This all while our main client had to do monthly reports with 60+ users starting at 8 am. It never happened.

Windows had to be re-installed but this was not as easy as it sounds. The technical guys at the data-center had to connect an external DVD drive to the machine with the Windows DVD in it. After seeming hours of struggling from my side to get the System to boot and struggles from their side to sort out the boot order (and only getting about 1 or 2 responses from them per hour) We eventually got the damned thing to boot. This was the start of another deep hole.

I’ll spare you all the fine details, but at the end of the day, the removed RAID drive could not be picked up by Windows, so they had to connect it to one of their machines and copy our critical data on a good old flash drive. Which was then connected to our server. We RMM’ed in copied the data over. This all at 10PM the next evening.

So we lost everything, even though we had a RAID backup, and daily backups of our critical data. But our daily backup was 1 day old which was not good enough for our client.

Lesson learned: You don’t fully recover from a hardware failure. Due to downtime, physical trauma and lost sleep. But at least you get out with a bit of experience and less trust in our volatile connected world.



One comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top