So in the land of VMware, and most other hypervisors, we have a great technology called “snapshots”.
You’ve probably landed on this page because when you try to delete a snapshot, it’s taking forever.
If that’s the case, I have to ask… How long have you had this snapshot? And how much data has been written since it was taken?
To understand how snapshots operate, it’s important to understand the composition of your average virtual machine. To be fair, various virtualization architectures exist, but VMware’s is fairly straightforward. Every virtual machine consists of two parts, a *.vmx, and a *.vmdk. You’ll fairly frequently see other components, but in the end, if you do not have a *.vmx, and a *.vmdk, you don’t have a virtual machine. As we dive a little deeper, the *.vmdk consists of two parts:
1) <File>.vmdk – This, in the jargon, is called the descriptor. It is, what it sounds like. This is the file that contains the characteristics of the disk, if it’s lost, it can be re-created.
2) <File-flat>.vmdk-flat – This is the actual disk. This is the money file. It is the deal breaker. The buck very definitely stops here. If the data is damaged, do not pass go, do not collect $200, just restore from a backup.
I had a a case where a customer had taken a snapshot nearly 6 months ago and was trying to commit it after their datastore had run out of space. They had no idea how this could happen as they didn’t have any other virtual machines on the disk and the one they did have was thick provisioned. Where did all the extra space go? they asked.
Well when a snapshot is created, all changes to the <File-flat>.vmdk-flat are stopped, and a new file is created. The new file is a -delta.vmdk file. Where all new writes since the snapshot was taken are stored. When running off of a snapshot for extended periods of time, one can see this delta file growing larger and larger. If you didn’t thin provision your VM, and left your datastore with headroom, you’re going to be out of space in short order.
Now, onto the issue of slowness when deleting a snapshot. The problem is that when trying to delete a snapshot (commit changes back into the main VMDK file from the delta), your physical disk is written completely random with the information that comes from your delta file. Since in most cases the delta file resides on the same disk there is a lot of competition between read and write. So don’t expect the same speed as in copy jobs that work sequential with large – buffered written blocks.
I see a lot of people using snapshots for the purpose of backups. If you’re one of them, STOP! Save yourself the headache that is soon to come, and get an actual backup / restore procedure figured out.
We engineers / software nerds commonly use a concept, for lack of a better term, called version control. Code exists in a main branch or trunk. Write a new feature, or code a new bug-fix, and check the new code into the “build”. If the new bug-fix doesn’t work out, back it out. Use the build prior to the fix, however, ultimately, if the new bug-fix DOES work out, that, in essence, BECOMES the new build.
Emulate this kind of thinking when you snapshot a virtual machine. Use snapshots not to create backups for your VM’s, but as a form of version control. Snapshots are intended for short term use only. Got an OS patch coming for a critical VM? Take a snapshot and wait a couple days, perhaps a week. Once you’re certain the patch is viable and won’t cause excessive disruption, remove the snapshot!
Responsible snapshot usage will save you time / money / and trips to the bar in an attempt to dull the headache you received when trying to fix issues with out of control snapshot sizes.