Swapping hard drives across nodes ... baaaad idea...

I lost sleep over this event so I figured I'd put it in writing now that most of it is over...

Prologue.

We recently purchased two new 48-core nodes for our astronomy cluster here at Villanova. I decided to bite the bullet and designate one of the two new nodes to be the server for the whole cluster, and the other be just the regular node. I installed the basic system, hardened it, installed all the relevant software and then gone on to install... gulp... IDL.

The plot thickens.

In order to install IDL, I needed to run the installer from the CD. I'd love to, but it requires graphical libraries, and clusty, while it does have the graphics libraries for remote X sessions, it doesn't have any X-servers running locally. So in order to install IDL, I had to install the whole X-system, and on top of that the whole desktop set of packages. As if that wasn't enough, I had to install the 32-bit compatibility libraries just to be able to install IDL... Sigh...

I bit the bullet and did it. Installing IDL consisted of creating a particular directory and dumping all sorts of stuff into that directory. Curiously enough, nothing except the installer itself depends on the desktop system, so all the stuff that bloated and polluted the main system now sat there unused and obsolete. So I gave the ITT guys a call and asked whether simply copying the directory with IDL to another machine would work equally well, and the answer was, as you can guess, yes, for as long as I copied all the symbolic links to the bin/ directory. While at it, I requested a new license file since the MAC address of the primary network card changed by virtue of changing the main node.

The ingenious plan.

So now I was stuck with this polluted system and no matter how hard I tried, I couldn't get rid of the whole desktop system. Hundreds of packages were installed and scripting something to parse the log and remove them seemed simply too daunting for what I thought was a great alternative: why don't I simply install the same base system on the other node, then simply copy the IDL directory onto it, install only what I must and be done with it? So that's what I did. In a matter of a couple of extra hours, the new node had all the software and services it needed, and a shiny new IDL directory was there without the bloat of the whole desktop.

The first oops.

Of course, when everything was set up, I ran into the first hurdle. The MAC address of this new node changed again, so I couldn't use the license file from ITT. To make things worse, when I asked for a new one, they made an issue out of it saying that our support expired (huh? We bought an official 7.1 version, how can our support expire!? But hey, they're all about making money, not making sense). So instead of trying to reason with ITT, I decided to just spoof the MAC address and that solved the issue for a while. But in the mean time, the University's DHCP server was expecting the other MAC address to assign a static IP to... Grumble grumble...

The second ingenious plan.

Well, the system is completely installed and functional, PXE and all, every other node boots off of this node. With next to no tweaking, why don't I simply swap out the hard drives? If I took the drives from the old node, put them in the new node, and put the drives from the new node to the old node, then everything will work, right?

The second oops.

Wrong. And boy, was I careful about it! I changed the DHCP table, I made sure everything is fine... I reboot and... wham. I get hit by the server trying to mount an NFS root. Sigh. This was my fault. The last initramfs that I generated was done for the netboot, when I was setting up the client system. So I had to boot into it using the recovery USB, fix the ramdisk and reboot.

The third oops.

This was a major one. It literally drove me crazy. Upon reboot, the system hung on bringing up networking. Wait a minute, I just booted from a USB and the networking got set up perfectly! So why doesn't networking work? The boot-up looked something like this:

rpcbind: cannot open '/var/run/rpcbind/rpcbind.xdr' file for reading, errno 2 (no such file or directory)
Waiting for network configuration
Waiting up to 60 seconds more for network configuration

After that delay, the login prompt appeared and I figured that won't be too tough to fix. I wasn't sure whether the rpcbind error was somehow affecting the network to be unavailable, but I googled for it and found this bug report. Essentially, I had to edit portmap.conf and fix the script so that '-w' is not always implied.

Check.

After that, the networking was still failing. I should perhaps mention that this was a live machine with a live web server where the contents are very important and frequently accessed. And here I was, stuck in getting the network back up.

Logging in, I looked at the syslog, and both ethernet cards were detected and reported as eth0 and eth1. Then I checked lspci, and both ethernet cards were showing up nicely. But when I tried bringing them up, I got the utterly confusing:

eth0: ERROR while getting interface flags: No such device

And the same for eth1. Huh?? But they're right here! Checking syslog again... sure enough, the igb module reports them both alive and well, labeled eth0 and eth1. Long story short, after trying all sorts of things with networking setup, bios setup, all the craziness, I finally stumbled across this page that suggested the following:

sudo ip link show

This listed my loopback device, and my two network cards, labeled as eth2 and eth3, with eth2 being the old eth1 and eth3 being the old eth0. %-/

Entirely confused, I tried to ifconfig eth3 up and, sure enough, the server was back online in seconds. But how come these were renamed? And where did my eth0 and eth1 disappear to? The final answer to this puzzle came from this page. On install, ubuntu does hardcode the actual MAC addresses! They reside in /etc/udev/rules.d/70-persistent-net.rules. Obvious, right?

A quick nano of this file indeed showed both previous node NICs listed here, on top of a single new one. So removing the new one and changing the MACs of the old ones, and a reboot later... voila, eth0 and eth1 are back to where they're supposed to be...

Epilogue.

This whole endeavour took over 2 hours and caused me a few mild heart-attacks. All happy that I managed to restore the server, I tried to boot up the nodes. Sigh... They're failing to boot up, getting hung after loading the kernel from NFS. I've had enough frustration for one day, and I have a slight suspicion that that will be material for another story... But I cannot conclude without putting an icing on this cake: after booting up successfully, you'd think IDL now works? Well think again... Even though ITT said that the license can refer to the MAC address of eth1, IDL is failing with the "invalid license" error saying it has to be the MAC of eth0. Fun times. Thanks, ITT. If I survive the next series of heart-attacks, I may report on how that went...