ESXi 4.0 – Lessons Learned
I’m not sure how many people were paying attention on a Friday night / Saturday morning, but I decided to do the long put off upgrade to ESXi 4.0 (from an old CentOS 4 install running VMWare Server v1.0, yuck). While it wasn’t an extremely painful experience, I can say that I wish a few things were more common knowledge on the internet.
To begin the install was easy, I just had to do a little prep work first. Since my websites were going to be offline (included the others I host for other people) I wanted to do a redirection of all HTTP traffic to another host which would stay up so that I can display a page about the site being under maintenance. Normally you would use a F5 or something to do this for you, but I certainly don’t have anything that fancy laying around so I had to settle for something simpler. I borrowed a small ASUS Eeepc from my good friend Brian Yeager. I did a quick install of windows (could have done linux too, but I had my windows CD handy already) and installed XAMPP on it. I did a quick index.html and added a .htaccess file to do a redirect of any webpage back to that index.html (so if you tried to goto download.php it would redirect, etc…). Now this was easy enough to do. I gave it a spare static IP I had and it was up and running. I had originally planned on doing a Destination NAT on my Mikrotik to redirect all port 80 traffic over to this server, but I ran into a few issues and decided to instead just add the IPs of the VMs to this server as I take them off line. Simple enough.
My next step was to backup the current VMs. I already had a external ESata drive hooked to the server, and had previously made backups to it. So I shutdown each VM, added the IP to the redirection server, and then copied the VM to the external drive. Now the important thing to note here is that the external drive was formatted EXT3. Which since I was running Linux was a logical choice. Now enters the first issue. I have 16 Gig of RAM in my server. After shutting down VMWare completely, and absolutely nothing else running, I was only using 128 Meg of memory. To transfer to the external drive I was using Rsync (I never trust a plain copy for such large files). My VMs were set to split into 2 Gig files. When I would transfer using Rsync, the I/O Wait would naturally go up to ~30%. The amount of free memory would quickly decrease, and you could see that it was then being used up by the Linux memory cache. The problem was that when the free memory would reach < 10 Megs, Linux would try to free up the cached memory, and the I/O Wait would sky rocket for some reason, which is turn would cause the Load Average of the server to Sky Rocket. During this time it would never utilize the swap (which it shouldn’t have). This caused a huge headache, as the cached memory would grow twice as fast as the data transferred. So if I was copying a 8 Gig VM, it would chew up all 16 Gigs of RAM into the cache. So to get around this issue I had to run 2 SSH connections, 1 doing the transfer, the other doing a command to clear the cache when I saw it reaching too high.
sync; echo 3 > /proc/sys/vm/drop_caches
Now I believe this issue is probably causes by the combination of the old kernel I was using and the use of an external ESata drive, but I can’t confirm or deny it. But in either case I got it all transferred.
Next ESXi was installed, and all was well. I wiped the local drives, and setup a datastore. Tested the networking, and was ready to go. ESXi saw my External drive, knew all about it. The problem? ESXi apparently will not allow you to mount EXT3 partitions. The GUI ofcourse is no help, it can see it, but wants to format it to a VMFS to use it. Jumping into the command line (SSH again, since I turned that on) and trying to mount the device was a no go. You would always get the error “File or Directory not found” no matter what you tried. At this point it was getting late, Greg called me up, we went to dinner to discuss the difficulties, and I grabbed my external drive and headed home. The hope was that I could just pull all the VMs off the drive, re-format it to VMFS, and move them back and mount it as a store the next day. Once I got home (a hour drive) I toyed with it a little, and discovered through a few google searches that attempting to go this route may cause VMs to become corrupted so I abandoned that approach.
The next morning, after a quick breakfast I was off to the DC again. I had decided on a final (but slower) solution to the problem. I would just plug the drive into my laptop via USB (the drive has both USB and ESata) and transfer the files via SSH to the server itself. I first attempted one last ditch effort to get the ESXi server to recognize the external drive (both via ESata and USB) but ran into the same issues. So now I grabbed my laptop with the external drive plugged in, installed a copy of EXT2FS to see the EXT3 partition, and began the slow process of copying the VMs over. I copied the smallest VM first, so I could use it to test and ensure that everything was going to work without issues. I used WinSCP to copy directly from the external drive in Windows to the server to a temporary directory I created in the datastore. With WinSCP I was only averaging 5 MBps transfer, so a 8 Gig VM was taking roughly 30 minutes to transfer. I also tried using FastSCP, but since it was attempting to utilize compression, it was copying even slower at 3 MBps (I tested turning on compress via WinSCP and got the same result). I could tell this was going to be a long day.
While the first VM was copying, I began the process of migrating the VM itself to ESXi. Since I was converting from VMware Server v1.0, I couldn’t just import and go, it was going to take a little work. There are converters out there, but whether they work correctly or not is usually hit or miss, and honestly the way I did it just seemed a good bit easier. I had gotten lucky that I hadn’t used any IDE drives in my VMs (all SCSI) so the process was pretty straight forward. I logged into the ESXi server, and created a new VM exactly like the old one. I told it to create a new disk, the size doesn’t matter since we are not going to be using it. After creation, I then deleted the disk it created and waited for the VM to finish copying over. Once it was done, I used vmkfstools to clone the old HD into a new one, and then used the GUI to add that new cloned drive to the new VM. The command itself is easy enough, you just point it to the old VMDK that you want to clone and then point it to where you want the new VMDK.
vmkfstools -i /vmfs/volues/datastore1/tmp/OldServer/OldServer.vmdk /vmfs/volumes/datastore1/Server/Server.vmdk
Once that was done, I removed the IP address of the VM from the redirection server, booted up the VM, and re-setup the NIC on the VM if it was necessary, a few of the VMs didn’t require it as it saw the new NIC as the old one. A quick test to ensure everything was running perfectly, and I was good to go. Now onto the next one.
So what did I learn? First, the memory management on CentOS 4 is horrid, and as such I will be sure to upgrade my Cacti CD to use 5.X fairly soon just so I never have to touch 4.0 again. Secondly, if you are going to be backing up to an external drive, format it to a VMFS before you do so, otherwise you will have a heck of a time getting them into your new server. Third, bring something to do. I spent a good amount of time playing Oregon Trail to pass the time. Sorry Brian, you didn’t survive the trip.
Stay tuned to more later as I now begin to play with ESXi.