Pages

Sunday, May 30, 2010

vmguest corruption due to full nfs

Last tuesday tragedy hit. I felt as if the world had opened up and swallowed me whole. I thought I had lost 8 projects that were stored on the corrupted vm.
As rule we store our project data on a linux machine's /home path, which was supposedly stored on the vmguest.


It's strange how the mind works... I calmed myself down by desiging a more thorough backup system with rsync and nfs. Then I calmed myself down even more by redesigning how our switches could be more server-friendly, by optimizing for the number of available gigabit ports on the switches.


When I finally chilled the fuck out, I tried rebooting the vmguest twice more. I finally came to the realization that the vmguest had crossed the river styx for vmguests. Then I recalled that I had an rsync job that I wrote a while ago, that pushed away backups of /home over rsync. I checked the backup, apparently the last push was on last friday. Suddenly, I realized, there must be some vmdeity watching over me. I had already seperated /home from the vmguest. A while ago, I made this vmguest by cloning the real machine with this incantation:

dd if=/dev/sda of=/mnt/{external-drive}/image.raw

I converted the raw image to a qemu's qcow2 format and ran the image. Everything went well except for the cloned /home which was corrupted and unmountable. So I copied the /home from the real machine to the vmhost and mounted it as an nfs drive on the vmguest as /home. This ultimately saved my ass.

My project data was completely intact, only my linux kernel/userspace system was ruined. I reincarnated the system by installing an ubuntu system and installing pip, with a better functioning body, with a mounted nfs /home.
I was able to fix up about 60% of the system after working 16 hours that day. My minions helped to fix the rest of my fuckup for me.

What caused it? After I did some basic post-mortem checks of the system, I came up with the following:

Running vmguest images from an nfs or smb or any remote diskmount is a very very very bad idea, when the host running the remote diskmount's diskspace has a high probablity of hitting 100% some time soon.

In my case, I made the stupid mistake of scheduling a backup dump on the same remote diskmount. I had underestimated the available diskspace. So when the diskmount hit 100%, my vmguests locked hard and got corrupted.

The end

No comments:

Post a Comment

Please help to keep this blog clean. Don't litter with spam.