<update June 4>
Jason Gill posted the Root Cause Analysis done by VMware on his issue with VMware described below. Indeed the issue was because of the usage of the Dell PERC H310 controller which has a very low queue depth. A quote:
While this controller was certified and is in our Hardware Compatibility List, its use means that your VSAN cluster was unable to cope with both a rebuild activity and running production workloads. While VSAN will throttle back rebuild activity if needed, it will insist on minimum progress, as the user is exposed to the possibility of another error while unprotected. This minimum rebuild rate saturated the majority of resources in your IO controller. Once the IO controller was saturated, VSAN first throttled the rebuild, and — when that was not successful — began to throttle production workloads.
Read the full Root Cause Analysis here at Reddit
Another interesting observation while reading the thread on Reddit is that the Dell PERC H310 actually is an OEM version of the LSI 2008 card. John Nicholson wrote a very interesting blog about the H310 here.
Dell seems to use H310 with old firmware. When using the latest firmware the queue depth of the Dell PERC H310 can be increased to 600!
We went from 270 write IOPS at 30 ms of write latency to 3000 write iops at .2ms write latency just by upgrading to the new firmware that took queue depth from 25 to 600
This article explains how to flash a Dell PERC H310 with newer firmware. I am not sure if a flashed PERC H310 is supported by VMware. As a HBA with better specs is not that expensive I advise to only flash Dell PERC H310 when used in non-production environments.
June 02, 2014
An interesting post appeared on Reddit. The post titled My VSAN nightmare describes a serious issue in a VSAN cluster. When one of the three storage nodes failed displaying a purple screen, initially all seemed fine. VMware HA kicked in and restarted VM’s on the surviving nodes (two compute and two storage nodes). The customer was worried about redundancy as storage was located on just two nodes now. So SSD and HDD storage was added to one of the compute nodes. This node did not have local storage before.
However exactly 60 minutes after adding new storage, DRS started to move VM’s to other hosts, lots of IO were seen, all (about 77) VM’s became unresponsive and all died. VSAN Observer showed that IO latency had jumped to 15-30 seconds (up from just a few miliseconds on a normal day).
VMware support could not solve the situation and basically said to the customer: “wait till this I/O storm is over”. About 7 hours later the critical VM’s were running again. No data was lost.
At the moment VMware support is analyzing what went wrong to be able to make a Root Cause Analysis.
Issues on VSAN like the one documented on Reddit are very rare. This post will provide some looks under the cover of VSAN. Hope this helps to understand what is going on under the hood of VSAN and it might prevented this situation happening to you as well.
Lets have a closer look at the VSAN hardware configuration of the customer who wrote his experiences on Reddit.
VSAN hardware configuration
The customer was using 5 nodes in a VSAN cluster: 2x compute nodes (no local storage ) and 3x storage nodes, each with 6x magnetic disks and 2x SSD’s, split into two disk groups each.
Two 10 Gb nics where used for VSAN traffic. A Dell PERC H310 controller was used which has a queue depth of only 25. Western Digital WD2000FYYZ HDDs were used with a capacity of 2 TB, 7200 rpm SATA drives. SSD’s are Intel DC S3700 200 GB.
The Dell PERC H310 is interesting as in Duncan Epping post here it is stated:
Generally speaking it is recommended to use a disk controller with a queue depth > 256 when used for VSAN or “host local caching” solutions
VMware VSAN Hardware Guidance also states:
The most important performance factor regarding storage controllers in a Virtual SAN solution is the supported
queue depth. VMware recommends storage controllers with a queue depth of greater than 256 for optimal
Virtual SAN performance. For optimal performance of storage controllers in RAID 0 mode, disable the write cache, disable read-ahead,
and enable direct I/Os.
Dell states about the Dell PERC H310
Our entry-level controller card provides moderate performance.
Before we dive into the possible cause of this issue lets first provide some basics on VMware VSAN. Both Duncan Epping and Cormac Hogan of VMware wrote some great posting about VSAN. Recommended reads! See the links at the end of this post.
There are two ways to install a new VSAN server:
- assemble one yourself using components listed in the VSAN Hardware Compatibility Guide
- use one of the VSAN Ready Nodes which can be purchased. 16 models are available now from various vendors like Dell and Supermicro.
Dell has 8 different servers listed as VSAN Ready Node. One of them is the PowerEdge R720-XD which is the same server type used by the customer describing his VSAN nightmare. However the Dell VSAN Ready Node has 1 TB NL-SAS HDD while the Reddit case used 2 TB SATA drives. So likely he was using servers assembled himself.
Interesting is that 4 out of the 8 Dell VSAN Ready Node server use the Dell PERC H310 controller. Again, VMware advises a controller with a queue depth of over 250 while the PERC H310 has 25.
VSAN storage policies
For each virtual machine or virtual disk active in a VSAN cluster an administrator can set ‘virtual machine storage policies’. One of the available storage policies is named ‘number of failures to tolerate’. When set to 1, virtual machines to which this policy is set will survive a failure of a single disk controller, host or nic.
VSAN provides this redundancy by creating one or more replica’s of VMDK files and stores these at different storage nodes in a VSAN cluster.
In case a replica is lost, VSAN will initiate a rebuild. A rebuild will recreate a replica of VMDKs.
VSAN response to a failure.
VSAN’s response to a failure depends on the type of failure.
A failure of SSD, HDD or the diskcontroller results in an immediately rebuild. VSAN understand this is a permanent failure which is not caused by for example planned maintenance.
A failure of the network or host results in a rebuild which is initiated after a delay of 60 minutes. This is the default wait. The wait is because the absense of a host or network could be temporary (maintenance for example) and prevents wasting resources. Duncan Epping explains details in this post How VSAN handles a disk or host failure .
The image below was taken from this blog.
If the failed component returns within 60 minutes only a data sync will take place. Here only the data changed during the absence will be copied over to the replica(s).
A rebuild however means that a new replica will be created for all VMDK files being not compliant. This is also referred to as a ‘full data migration’.
To change the delay time see this VMware KB article Changing the default repair delay time for a host failure in VMware Virtual SAN (2075456)
Control and monitor VSAN rebuild progress
At the moment VMware does not provide a way to control and monitor the progress of the rebuild process. In the case described at Reddit basically VMware advised ‘wait and it will be alright’. There was no way to predit for how long the performance of all VM’s stored on VSAN would be badly affected because of the rebuild. The only way to see the status of a VM is by clicking on a VM in the vSphere web client. Then select its storage policies tab, then clicking on each of its virtual disks and checking the list – it will tell you “Active”, “Reconfiguring”, or “Absent”
For monitoring VSAN Observer provides insight on what is happening.
Also looking at the clomd.log could give indication of what is going on. This is the logfile of the Cluster Level Object Manager (CLOM)
It is also possible to use command line tools for administration, monitoring and troubleshooting. VSAN uses Ruby vSphere Console (RVC) command line. Florian Grehl wrote a few blogs about managing VSAN using RVC
- Part 1 – Basic Configuration Tasks
- Part 2 – VSAN Cluster Administration
- Part 3 – Object Management
- Part 4 – Troubleshooting
- Part 5 – VSAN Observer
The VMware VSAN Quick Troubleshooting and Monitoring Reference Guide has many details as well.
It looks like the VSAN rebuild process which started exactly 60 minutes after having added extra storage initiated the I/O storm. VSAN was correcting an incompliant storage profile and started to recreate replica’s of VMDK objects.
A possible cause for this I/O storm could be that the rebuild of almost all VMDK files in the cluster was executed in parallel. However according to Dinesh Nambisan working for the VMware VSAN product team;
“VSAN does have an inbuilt throttling mechanism for rebuild traffic.”
VSAN seems to use a Quality of Service system for throttling back replication traffic. How this exacty works and if this is controlable by customers is unclear. I am sure we will soon learn more about this as this seems key in solving future issues with low-end controllers and HDDs combined with a limited number of storage nodes.
While the root cause has yet to be determined a combination of configuration choices could have caused this:
1. Only three servers in the VSAN cluster were used for storage. When 1 failed only two were left. Those two both were active in rebuild for about 77 virtual machines at the same time.
2. Using SATA 7200 rpm drives as the HDD persistent storage layer. Fine for normal operations when SSD is used for cache. In a rebuild operation not the most powerfull drives having low queue depths.
3. Using an entry level Dell PERC H310 disk controller. The queue depth of this controller is only 25 while advised is to use a controller with 250+ queue depth.
1. Just to be on the safe side use controllers with at least 250+ queue depth
2. for production workloads use N+2 redundancy.
3. use NL-SAS drives or better hdd. These have much higher queue depths (256) compared to SATA hdd (32).
4. in case of a failure of a VSAN storage node: try to fix the server by swapping memory/components to prevent rebuilds. A sync is always better than a rebuild.
5. It will be helpfull if VMware added more control for the rebuild process. When n+2 is used, rebuild could be scheduled to be executed only during non-business hours. Also some sort of control of priority on which replica’s are rebuild first would be nice. Something like this:
in case n+1: tier 1 vms rebuild after 60 minutes. tier 2,3 rebuild during non-business hours
in case n+2: all rebuilds only during non-business hours. Tier 1 vm’s first, then tier 2 then tier 3 etc.
Some other blogs about this particular case
Jeramiah Dooley Hardware is Boring–The HCL Corollary
Hans De Leenheer VSAN: THE PERFORMANCE IMPACT OF EXTRA NODES VERSUS FAILURE
Some usefull links providing insights into VSAN
Jason Langer : Notes from the Field: VSAN Design–Networking
Duncan Epping and others wrote many postings about VSAN. Here a complete overview.
A selection of those blog posts which are interesing for this case.
Duncan Epping How long will VSAN rebuilding take with large drives?
Duncan Epping 4 is the minimum number of hosts for VSAN if you ask me
Duncan Epping How VSAN handles a disk or host failure
Duncan Epping Disk Controller features and Queue Depth?
Cormac Hogan VSAN Part 25 – How many hosts needed to tolerate failures?
Cormac Hogan Components and objects and What is a witness disk