Know the performance impact of snapshots used for backup !

A good understanding of VMware snapshots is essential when using backup tools using snapshot technology.

Snapshots can hurt performance of virtual machines. It even can lead to freeze / unresponsiveness for many minutes.

A traditional backup tool is not aware a server is running as a virtual machine will make a backup of the files inside the virtual machine. This will impact the vCPU and the network as all data will be transfered over the network to the backup server. Much better is using a backup tool which is able to make image level backups. Many tools are available on the market to make such backups. It copies the virtual disks and is able to save the characteristics of the virtual machine like the number of vCPUs, internal memory etc. Very important when a full recovery of one or more virtual machines is needed. You do not want to manually re-create virtual machines, boot an operating system and start recovering files.

To make a crash consistent backup of a VMware virtual machine, backup software uses the snapshot technology of VMware. It simply sends an API-call to vCenter Server or the ESX(i) host with the request to make a snapshot. When a snapshot is made, data of the virtual machine is not written anymore to the VMDK files of the virtual machine. Instead data is written to a temporary snapshot file. This enables a consistent backup of the virtual machine because during the runtime of the backup the VMDK is not changed.

Mind the backup is crash consistent which means the virtual machine will boot up when recovered from backup! This will not guarantee that applications using databases will boot up fine as well. To make sure databases can be recovered, a VSS-snapshot needs to be made as well when creating snapshots.

When the backup is done, the backup software will delete the snapshot file. Deletion or commiting the snapshot file means all data that has  temporary been written to the snapshot file is being purged into the VMDK files. While this snapshot commit is running, a second snapshot file is created to store the data which is written by the VM during the commit phase.

When a VM is busy with diskwrites  during the runtime of a snapshot based backup (for example a heavily used Exchange Server or database), the commit will take some time and uses relative lots of IO. Obviously lots of data need to be read and written. VMware ESX(i) needs to read the VMDK to see if blocks of data exists which are also existent in the snapshot file. If  data does not exist in the vmdk and does exist in the snapshot file that  data is written to the VMDK.

At the end of the commit the VM needs to be frozen for a short while to be able to process the last writes without having the create another snapshot file. This is called stun/unstun

The whole process is described in an excellent article of Erik Zandboer titled Performance impact when using VMware snapshots 

The commit process can have a negative effect on the performance of virtual machines during backup. I have seen cases where users could not login to servers because at that time a snapshot commit was done and the backup job was running. In vCenter Server the text ‘ Remove snapshot’  with status 95 %. The process seems to be stuck at 95 % for a long time. As soon as the backup job was aborted the performance issues were over.

The effect of snapshot commit is one of the reasons Veeam Backup & Replication uses serial processing of virtual machine backup. A backupjob having 10 vm’s will start with backup of the first vm, when ready the next vm and so on. If parallel processing is done there is a chance the jobs will saturate the SAN. First because the data read from SAN send to the backup storage and secondly when the backup has finished and snapshot commits are done. Another disadvantage of parallel processing in Veeam is that multiple backup jobs needs to be created and this will reduce the efficientcy of de-duplication done by Veeam B&R.

The effect is even more noticeable when virtual machines are stored on NFS storage. This is a link to a VMware KB article titled Virtual machine freezes temporarily during snapshot removal on an NFS datastore in a ESX/ESXi 4.1 host

Here are some links to postings on the forum about timeouts and freeze on virtual machine during snapshot removal. Again, these issues are not related to Veeam B&R. Timeouts can occur using any backup tool using snapshots. It is caused by the way VMware handles snapshots.

VM timeouts during snapshot commit
Guest VM halts during replication snapshot
Snapshot removal issues of a large VM

Besides being able to backup a virtual machine, Veeam B&R is also able to replicate virtual machines. This enables a low RTO and RPO a bit like Site Recovery Manager does. Mind however the replication function will use snapshots! If your replication schedule is set at 15 minutes during working hours on a server busy writing, there is a risk of performance issues!
Make sure to test the effect of replication jobs on the performance of virtual machines. If the effect is negative, an alternative solution should be considered. While Veeam does replication at the virtual machine disk level, Microsoft DPM does replication on the application data level. DPM does not create a snapshot and uses Changed Block Tracking. Instead it tranfers the changed data blocks of files. So no snapshot commits.

To avoid performance issues on virtual machines caused by snapshot:

  • snapshots must be used to save the state of a virtual machine and deleted as soon as possible. Do not keep a snapshot file if not needed anymore. The snapshot file will grow fast on servers with lots of writes. Commiting the snapshot will have an impact on performance. Forgetting to delete the snapshot can lead to disk full situations. Use tools like RVtools to quickly see if a vm has live snapshots.
  • make sure the performance of the storage area network is known before a full rollout of backup jobs.
  • Backup jobs are preferably run during off hours when the load on servers is low.
  • if replication is used, make sure you know the impact before applying it in production
  • to make the effect of snapshot commit less, use SSD drives to store the snapshot file on or spread the I/O over the available LUNs. See this article for more info.

More info
Timothy Dewin has written a great blog explaining the way snapshot works and why performance can be negatively be affected during creation and deletion of snapshots.  Recommended read!

Add a Comment

Your email address will not be published. Required fields are marked *

Current ye@r *