Recover VMs with corrupt snapshots
Consulting throws up many challenges during the design and implementation stages but none more than the actual environment integration. Being at the ‘coal face’ invariably provides a point at which things don’t always go to plan and it’s this real world experience that we at Xtravirt excel at.
In this, my first blog posting, I’m going to discuss VMware snapshots and the possibility that you can recover from corrupted ones.
Particular events can create situations where a VM might start rebooting or shut down completely, and during this unplanned process one or more snapshots for that machine may get corrupted.
A common scenario for this kind of corruption is when:
- A VM starts displaying the message in the console:
“The redo log of <Machine Name>.vmdk is corrupted. Power off the virtual machine. If the problem still persists, discard the redo log.”
- Pressing OK to the message mentioned above, causes the machine to display the message again
- Powering-off the VM might not be possible and could be displaying the message in the console:
“The attempted operation cannot be performed in the current state”
Depending on the type of failure, recovery from such a situation is possible and at times, with all data intact. The latter is especially true in the case for backup solutions that utilize the snapshot feature as part of their process but become corrupt just after it’s taken; therefore there isn’t a lot of changed data at that point. A complete recovery in this example is achievable.
I’ve recovered from such scenarios a few times and thought the process should be documented to help others. This blog posting came about as I felt that while different KB articles document the process in parts, I couldn’t find one that guides someone through the whole recovery process.
Some of the assumptions that I am making here are:
- The failure is occurring on VM(s) with one or more snapshots, created either manually or via an automated mechanism eg: a backup solution
- The virtual machine is displaying errors about inconsistent, corrupt or invalid snapshots
- The person working through the issue is familiar with VMware operations and can deal with minor variations in the discussed scenario
- The process to force shutdown of a VM is required for ESXi 5.x hosts (while syntax for other versions will be different, the process remains the same)
Virtual Machine Restore Process
Step 1: Save Virtual Machine Logs
The first action is to save logs for this VM; these can be found in the virtual machine folder on the datastore. This is to avoid losing potentially valuable diagnostic data in the event of a catastrophic failure. Due to the state the virtual machine is in, it might not be able to save vmware.log but the other log files should be copied directly from the datastore to a safe location.
Step 2: Shutdown Virtual Machine
This is to avoid having any further damage to the current snapshots before a copy of the machine is made. It’s possible for vCenter to lose control of the virtual machine in such situations and power operations might not work from the VI Client. If that happens, refer to “Force Virtual Machine Shutdown Process” section near the end of this posting for techniques to force the shutdown of the machine.
Step 3: Make a copy of the Virtual Machine folder
Once the virtual machine is shut down, make a copy of the virtual machine folder to another location on the same or another datastore. Name the folder something appropriate eg: <Machine Name>-Backup.
Note: A clone is not what is required and it probably won’t work in such a situation.
Step 4: Attempt to fix the snapshots
First check if the datastore has enough space remaining; snapshots do become corrupted if there isn’t enough space available. As there might be other snapshots in the background, estimate generously and if there isn’t enough space, use Storage vMotion to migrate machines off that datastore, to have a safe level of headroom available.
Once there is enough space available, try taking another snapshot, and if successful, try committing it. This operation might fix the snapshot chain and consolidate all data into the disks. If this process fails, then follow the remainder of the process to manually restore the machine from remaining snapshots.
Step 5: Confirmation of existing virtual disk configuration
Go into the VM settings and confirm the number and names of the existing virtual disks. As there are snapshots present, the disk(s) will be pointing to the last-known snapshot(s). Also, make note of the datastore the machine resides on.
Step 6: Command-Line access to ESXi server
Gain shell access to an ESXi server in the cluster which can see the datastore with the virtual machine in question. The ESXi server should also have access to the datastore where the repair will be carried out. As SSH may be disabled (by default), you may have to start the service manually.
Note: Seek approval (if security policy requires it) before this is done.
Once SSH is enabled, use PuTTY (or a similar tool) to connect and login using “root” credentials
Step 7: Confirmation of snapshots present
Once logged in, change directory to:
/vmfs/volumes/<Datastore Name>/<Machine Name>
ls *.vmdk –lrt
to display all virtual disk components.
Make note of what “Flat” and “Delta” disks are present. While it can vary in certain situations, the virtual machine’s original disks will be named the same as the virtual machine name by default. If there is more than one virtual disk present, it should have “_1” appended to the base name and so on. If there are snapshots present, they will have “-000001” appended to each disk name for the first snapshot and “-000002” for the second and so on, by default. Make note of all this information.
Step 8: Repair of the virtual disks
Start with the highest set of snapshots and for each disk in that set run the following command, where <Source Disk> is the source snapshot:
vmkfstools –i <Source Disk> <Destination Disk>
Please note: <Source Disk> is the base .vmdk name, ie: not the one with –flat, -delta or –ctk in the name. <Destination Disk> is the new disk, where all disk changes need to be consolidated. The new name should be similar to the source but not identical. <Machine Name>-Recovered.vmdk is one example for the first disk. Keep the same naming convention throughout for all disk names eg: <Machine Name>-Recovered_1.vmdk, <Machine Name>-Recovered_2.vmdk and so on.
vmkfstools –i <Machine Name>-000003.vmdk <Machine Name>-Recovered.vmdk
for the first disk from the third snapshot set.
vmkfstools –i <Machine Name>_1-000003.vmdk <Machine Name>-Recovered_1.vmdk
for the second disk in the same set and so on.
Repeat the process for all disks in the snapshot set identified earlier in step 7. If the process is successful, move on to step 9.
If there is failure on one or more disks in the set, the following error message may be displayed:
Failed to clone disk: Bad File descriptor (589833)
If that error occurs, skip that disk and keep running the process for other disks as they might still be useful. However, the set will likely be rejected to run as production so the next recent snapshot set should be tried. Follow the same process until all disks in a snapshot set are successfully consolidated into a new disk set If this is an investigation into the events leading up to the failure then additional sets might have to be consolidated in the same way. All sets should now consolidate successfully.
Step 9: Restoration of the virtual machine
Using the “Datastore Browser”, create a new folder called “<Machine Name>-Recovered”, either on the same datastore or another. Move the newly-created “Recovered” vmdk file(s) to the new folder. Also, copy <Machine Name>.vmx and <Machine Name>.nvram to the new folder and rename both files to become <Machine Name>-Recovered.*
Download <Machine Name>-Recovered.vmx to the local machine and edit it in Wordpad or similar. Replace all instances of <Machine Name>-00000x (where “x” is the last snapshot the machine’s disks are pointing to) with <Machine Name>-Recovered. Repeat for other disks if present e.g. _1, _2 and save the file. This should make the .vmx match all newly-consolidated disks. Rename the original vmx file in the datastore to <Machine Name>.vmx.bak and upload the edited <Machine Name>.vmx back into the same location. Once uploaded, go to the “Datastore Browser”, right-click the vmx file and follow the standard process of adding a virtual machine to inventory, possibly naming it “<Machine Name>-Recovered”.
Once in the list, edit the VM settings and disconnect the network adapter. It might require connecting to a valid VM network first but the main thing is that the network adapter should be disconnected.
Once done, take a snapshot of the virtVM and power the machine up. At this point, a “Virtual Machine Question” will come up. Answer it by selecting the “I copied it” answer. If the disk consolidation operation was successful for all disks, the machine will come up successfully. The machine can now be inspected and put into service or investigated for a problem.
Once operation of the machine has been tested and the decision has been made to bring it into service, shutdown the virtual machine, reconnect the virtual network adapter to the correct network and power it back up. After boot is complete, login to the machine to confirm service status, network connectivity, domain membership and other operations. If all operations are as expected then the restore process is complete and the snapshot can be deleted.
Force Virtual Machine Shutdown Process
First Technique: Using vim-cmd to identify and shutdown the VM
While connected to the ESXi shell and logged in as “root”, run the following command to get a list of all VMs running on the target host:
The command will return all the VMs currently running on the host. Note the Vmid of the VM in question. Get the current state of that VM as seen by the host first, by running:
vim-cmd vmsvc/power.getstate <Vmid>
If the VM is still running, try to shut it down gracefully using:
vim-cmd vmsvc/power.shutdown <Vmid>
If the graceful shutdown fails, try the power.off option:
vim-cmd vmsvc/power.off <Vmid>
Second Technique: Using ps to identify and kill the VM
Warning: Only use the following process as a last resort. Terminating the wrong process could render the host non-responsive.
While connected to the ESXi shell and logged in as “root”, list all processes for target virtual machine on the current host by running:
ps | grep vmx
That will return a number of lines. Identify entries containing vmx-vcpu-0:<Machine Name> and others. Make note of the number in the second column of numbers, which represents the Parent Process ID. For most of the lines returned for that machine, this number should be the same in the second column. One line belonging to “vmx” will contain that number in both first and second columns. That is the ProcessID of the target virtual machine.
Once identified, terminate the process using the following command:
Wait for a minute or so as it might take some time. If after that, the VM hasn’t powered-off, then run the following command:
kill -9 <ProcessID>
The method in the section will not result in a graceful shutdown but it should terminate the machine, allowing for the recovery to take place. If the machine still cannot be terminated, further investigation will be required on the host and the only option left will be to vMotion other virtual machines off this host and rebooting the host in question.
The beauty of virtualization is that one can test most service scenarios without actually causing impact to service and this process is no exception. For that reason, I would strongly recommend practicing this process in your lab environment so that you are well prepared in case disaster strikes.
Tags: Virtual Infrastructure