Disk failure testing on LSI Based Virtual SAN RAID0 controllers

Disk failure testing can be an integral part of a Proof of Concept, you want to ensure that Virtual SAN behaves in the correct way right?? In this post I shall talk about how to successfully perform disk failure testing on LSI Based RAID0 controllers in order to help you validate the behaviour expected from Virtual SAN.? Before I do that, I have seen many instances where people are attempting to simulate a disk failure by pulling a disk out of the backplane slot and not seeing what they expect to be expected behaviour, pulling a disk is not the same as an actual disk failure, Virtual SAN will mark a pulled disk as absent and will wait for the value of ClomRepairDelay (which the default is 60 minutes) before starting a rebuild of components residing on that disk.? With a failed disk Virtual SAN will rebuild components residing on that disk immediately, so you can now see the expected behaviour in the two different scenarios.

So how do we perform disk failure testing on an LSI Based RAID 0 controller?? Let’s first of all look at the different options we have

 

Health UI Plugin

In Virtual SAN? 6.0 there is the option to deploy the Health UI Plugin, this plugin for vCenter has a secondary routine that will deploy out components onto the ESXi host to allow them to report back to the the plugin on any issues that are seen, another part of this is the Error Injection routine that allows the user to inject a temporary or permanent device failure and also clear the injected error, for details on how to use the error injection, please refer to my post on Using the Error Injection Tool in the Virtual SAN 6.0 Health UI plugin to simulate a disk failure, for VSAN 5.5 users this feature is not available.

 

LSI based ESXi command line tools

LSI have a utility that is deployable on an ESXi host th manipulate and make changes to their controllers without having to go into the controller BIOS, the utility is very powerful and allows you to perform functions such as:

  • Create or Destroy a RAID Virtual Disk
  • Change cache policies
  • Change the status of Virtual Disks
  • Block access to Virtual Disks
  • Import Foreign Config

Please note: For the latest version of StorCli please visit the LSI Website for MegaCli please visit your server vendor website

Above are just a few examples of things that you can do using the command line tools.? Depending on who the OEM is for your LSI Based card, you would either use StorCLI or MegaCLI, we will cover both command sets for performing the same operation, for the cross references, I used this document from LSI, So which command option do we use to simulate a disk failure?

The obvious choice would be to set a RAID virtual disk or a Physical disk into an Offline state, before I seen the behaviour of doing this first hand I would have agreed 100%, however after seeing what happened after doing this made me look further at other options of achieving the same thing, but the question you may be asking is why would this not work?

When placing a RAID disk or Physical disk into an Offline state, what I observed in the ESXi logs is that there was no SCSI sense codes being received by the host from the controller, I could see a Host Side NMP Error of H:0x1 which equates to a No-Connect and the action from that is to failover, the disk was marked as Absent in the Virtual SAN UI but not failed, so obviously this is not what we wanted to achieve.? After a bit more digging and testing I finally stumbled across an option which after testing a few times resulted in the correct behaviour, so what was the option?

> storcli /c0 /v2 set accesspolicy=blocked

In the above command I am selecting Controller 0 with the /c0 option, and Virtual Disk 2 with the /v2 option, what this did is block access to the RAID Virtual Disk and send SCSI Sense Code information to the ESXi host which was:

  • Device Side NMP Error D:0x2 = Check Condition
  • SCSI Sense Code 0x5 = Illegal Request
  • ASC 0x24 = Invalid Field in CDB
  • ASC 0x25 = Logical Unit not Supported

This resulted in the disk being reported as Failed which was the expected behaviour, the desired command for MegaCli is as follows:

> MegaCli -LDSetProp -Blocked -L2 -a0

Where -a0 is the adapter number and -L2 is the Logical Drive number (Virtual Disk)

After the disk failure testing has been completed, in order to remove the blocked access to the Virtual Disk you issue the following StorCli command:

> storcli /c0 /v2 set accesspolicy=RW

And in MegaCli

> MegaCli -LDSetProp -RW -L2 -a0

After you have re-allowed access to the Virtual disk you will need to go into the Virtual SAN UI and remove the affected disk from the disk group and re-add it back in, the same applies if you was performing the failure test on the SSD that heads a disk group, only this time you will have to remove the whole disk group and re-add, after all you have just in effect replaced a failed disk 🙂

Leave a Reply