Day-2 Operations – Health Monitoring

Health monitoring of an infrastructure is a key element of day to day operations, knowing if something is healthy or unhealthy can make the difference between business impact or remediation steps to prevent any impact ot the business.

There are three ways you can monitor the health of vSAN, the native Health Service which is built into the vSphere UI, vRealize Operations (vROPS), and the API all of which have advantages and disadvantages over the other, for the first part we are going to cover the Health UI which is incorporated into the vSphere UI.

vSAN Health Service
In the current release of vSAN (6.6.1) there are two aspects to the Health Service, an Offline version and an Online version, the Offline version is embedded into the vCenter UI Code and any new features are added to this when patches/releases/updates for vSphere are released.  The Online portion of the Health UI is more dynamic, newer health variables are added as part of the Customer Experience and Improvement Program (CEIP), there is a major advantage to using the Online version in the fact that critical patch releases for vSAN can be alerted through the Health UI which is a really cool feature, it also dynamically adds new alarms to vCenter as part of the health reporting, as VMware understands and gets feedback on how customers are using vSAN, VMware can create alarms dynamically to alert/avoid situations that are a cause for concern.

In order to use the Online portion of the health service you need to opt in to the CEIP program, which is as simple as ensuring your vCenter server has internet capability and you have provided a myvmware account credentials.  A lot of customers are concerned with having their vCenter server having the ability to connect out to the internet, as a workaround I recommend a method where vCenter only has an allowed rule to connect to addresses such as a proxy server or white list.

The health service is designed to report on all aspects of vSAN health, and trigger alarms in vCenter to alert you to anything that you should pay attention to, in the previous screenshot, you will notice that I have a warning against the cluster, this is due to the cluster disks not being evenly balanced due to me placing two hosts into maintenance mode to perform firmware updates, as you can see from the screenshot on the right, this has also triggered an alarm in vCenter.

A really cool feature of the Health Service is the “Ask VMware” button, simply highlight an issue and click the button and it will load up a VMware Knowledgebase article telling you what the issue means, why it has occurred and steps to remediate, as many of you know, I come from a support background and spent a good few years in VMware Support so the whole ability to self help and be provided with the right information straight away at the click of a button can be a huge time saver in my opinion.   There is a KB article for every section of the Health Service and as you can see from the screenshot on the left for my disk balance warning,  there is quite a lot of detail in each KB article and the resolution steps are well documented and easy to follow. If after you have followed the steps in the KB and your issue still persists, remember to include in your support request that you have followed the KB Article so you are not asked to run through it again as part of troubleshooting.

When you have deployments such as a 2-Node ROBO or a Stretched Cluster, you do not have to tell the Health Service about this, it will automatically detect and populate the appropriate health checks such as Site to Site latency and witness connectivity.

Critical Patch Updates – The Online Health Service also has the ability to make you aware of a critical patch release, as part of the “Build Recommendations” element of the Health Service, so as a critical patch is released, the online health service will dynamically slip entries into the UI to alert you of the release, in my opinion this is so much better than waiting for an email notification.  The benefit for this is that it can be tailored for your environment, so you are not receiving notifications for updates that are not applicable to your environment.

A question I get quite often is how frequent does the Health Service report, the simple answer is by default it is designed to check the health every 60 minutes, in my environment I have this set to the lowest value which is 15 minutes, however, if there is a critical issue for example, a host failure, network connectivity issue or disk failure then the health service will update with this information pretty instantly, it will not wait for the next refresh cycle and again you will see vCenter alarms triggered for the events.

vRealize Operations
Now I have to admit, I am not a vROPS specialist in any way, shape or form, but I have deployed vROPS in my demo environment and have got the vSAN dashboards operational pretty easily without any challenges.  Now to be clear I am using vROPS 6.6.1 which has the built in native vSAN dashboards which were not present in earlier versions and required you to use the vSAN Storage Management pack to enable the capturing of vSAN Metrics, below is a screenshot showing the default metrics reported in then vSAN Operations Overview dashboard

One immediate advantage that vROPS has over the vSphere UI is that vROPS can display a holistic view of all your vSAN clusters, whereas the vSphere UI is only showing you the status for the cluster that is in focus, so if I had multiple vSAN clusters deployed they would all be listed here in this single dashboard which makes operational life that little bit easier.

You can see there’s a wealth of information at your fingertips from an operational perspective, you can immediately see how your cluster(s) are performing as well as any potential issues that have triggered alarms, which then leads us to the Troubleshooting Dashboard, and here you can immediately see the reason for my 8 “Red” alerts:

As you can see, just like the Operations Overview dashboard, the Troubleshooting dashboard has a lot of information, this dashboard is designed to allow you to drill down into specific areas and components within vSAN, provide you heatmaps on various areas such as disks for example which when a heatmap is red it will draw your attention, for example if I was to double click on one of the green squares in section 9 which is labelled “Is the write buffer full on diskgroups” it will take me to that specific cache disk:

Which takes me to a dashboard specific to that disk group and provides me the following metrics:

As you can see from the above screenshot I can see various important information about my disk group, and if the heatmap was red for this specific disk group I would be able to easily see why based on the metrics presented to me, in my case my disk group is healthy.

There is another dashboard in vROPS called Capacity Overview which I will cover on another Day-2 Operations post based around capacity reporting, so watch out for that one.

So as you can see there are immediately advantages and disadvantages of using the Health Service over vROPS and vice versa, in my opinion I think both tools are important in day to day operations and being able to use both tools provides you with the toolset to effectively manage your environments easier.


Why HCI Matters in the Datacenter and Beyond

Technology is changing and evolving at an ever-increasing pace, whether you are a consumer of electronics, or you are a CEO of a large organisation with a large IT infrastructure, the changes in technology affect us all in different ways.  An example of this is CPUs and Flash Storage, we’re now at an era of constantly increasing CPU Core densities, and Flash Storage is becoming bigger and faster, these technology transformations are not only changing the way we operate as human beings in our own personal IT bubbles at home, but also within organisations too.

As organisations large and small take on the whole business transformation, a key element of the business transformation is their IT, whilst the last 15 years IT was more focused around being IT centric with traditional applications and the wide adoption of the internet.  The next 15 years poses some challenges as IT becomes more business centric along with cloud applications and the Internet of Everything.

A key enabler to the whole IT transformation is the Software Defined Data Center, many of you would have heard me talk about the Software Defined Data Center not as an object, but more as an Operating System that runs your IT infrastructure, if you are asked what three things are required to run an operating system?   You’ll find yourself answering storage, compute and networking for connectivity, which is essentially the three key elements that make up the Software Defined Data Center.

Hyper-Converged Infrastructure allows you to deliver capabilities that underpin the whole Software Defined Data Center based on a standard x86 architecture and offers a building block approach, it also brings the storage closer to the CPU and Memory which in a virtualised environment is highly benefitial and it is more VM centric rather than being storage centric.

So why is HCI being adopted by the masses?

There are a number of reasons for this, we’ve already outlined the fact that having the storage closer to the compute delivers a much more efficient platform, but outside of that there is a Harware Evolution which is driving the changes in infrastructure, rather much like an Infrastructure Revolution.

Higher CPU Core densities means you can run much more dense workloads, in conjunction with this, RAM has become much comoditized, affordable and available in larger capacity.  From a storage aspect Flash has evolved in such a way that is has enabled the delivery of high capacity and high performing devices that only a few years ago would have took a whole refrigerator sized array to produce but now can be delivered by a device that you can hold in the palm of your hand.  Another aspect from the storage side of things is that traditional storage is unable to keep up with the demands of applications and IT, this resulted in a new approach to storage and infrastructure….HCI

What is required from your storage platform?

I have met with many customers in various meetings or events, and depending on who you talk to in the organisation you will get a different answer to that question

  • Application Owner – Performance and Scalability
    They need to deliver an application that performs well as well as offers scalability, so the storage has to be able to offer this.
  • Infrastructure Owner – Simplicity and Reliability
    They need the platform to be simple to deploy, simple to manage but also offer reliability, they don’t want to be getting calls in the middle of the night from the Application Owner!
  • CFO / Finance Team – Lower Cost and Operational Efficiency
    There’s always somebody looking at the numbers and it’s usually this side of the organisation, reducing TCO, CAPEX, OPEX and making IT more cost effective is the biggest driver here.

Everyone is aiming for that sweet spot where all three circles converge the only problem is, with traditional infrastructure, you can never satisfy all three of the above requirements, there’s usually one requirement that has to be sacrificed, and that’s usually the Finance Team or CFO that has to back down in order to deliver the requirements for the Application Owner and the Infrastructure Owner,  this is where HCI is different, HCI brings everyone to that central convergence and meets the goals of all the requirements, so now everyone is happy, lets take a closer look at how HCI powered by vSAN meets these requirements

vSAN HCI delivers an architecture that not only delivers on performance but it is scalable simply by adding more nodes, or by adding more storage, it also allows for linear scaling of performance.  This means as your IT or business applications scale and demand more capacity or performance, then this is easily delivered in whatever increments meet the requirements at that point in time.

vSAN HCI allows the infrastructure team to deploy and manage environments at a simple management plane in a single interface, no separate management tools are required which means there’s no extensive retraining of staff required.  Reliability and Resiliency are built in with the ability to protect from a Disk Level all the way up to a Site level.

We’ve already talked about how HCI offers a building block approach, this means environments can be built to meet your requirements now and be grown as and when required.   Because there’s a much simpler management plane, this means operational efficiencies come into play as well, offering a more streamlined approach to IT

At this point we have met all of the criteria set by the three key stakeholders, but the benefits of HCI don’t just stop there there are other positive impacts that HCI brings to your organisation:

vSAN HCI offers a much wider of choice in the hardware that can be used along with different hardware vendors to choose from, there is also the range of different deployment options, this allows organisations to have a lot more flexibility on how they adopt HCI as well as having choices for newer hardware technology at their fingertips, this includes:

  • vSAN Ready nodes from all major server OEM vendors to suit all performance and capacity requirements
  • Turnkey appliance solution from Dell EMC which is VxRAIL
  • VMware Cloud Foundation which incorporates a full SDDC Stack

For deployment options, vSAN HCI offers the following:

  • Standard clusters up to 64 Nodes
  • Remote Office / Branch Office (ROBO) Solutions for customers with multiple sites
  • Stretched Cluster Solutions
  • Disater Recover Solutions
  • Rack Level Solutions
  • Same Site “Server Room” configurations

vSAN HCI allows organisations to become more agile by allowing  faster deployments, faster procurement and giving more control back to the business, which in a competitive world is a key enabler to success

As you can see, no matter what the size of your IT infrastructure is, HCI brings a wealth of benefits, from large scale data center deployments, to multi site ROBO deployments, there’s a perfect fit for HCI

Sizing for your workloads

When sizing a vSAN environment there are many considerations to take into account, and with the launch of the new vSAN Sizing tool I thought I would take time and write up what questions I commonly ask people in order to get an understanding of what they want to run on vSAN as well as a scope of requirements that meet that workload.

Obviously capacity is going to be our baseline for any sizing activity, no matter what we achieve with the other requirements, we have to meet a usable capacity, remember we should always work off a usable capacity for any sizing, a RAW capacity does not take into account any Failure Tolerance Methodology, Erasure Coding or Dedupe/Compression, this is something we will cover a bit later in this article.

Capacity should also include the required Swap File space for each of the VMs that the environment is being scoped for.

I have been involved in many discussions where it is totally unknown what the performance requirements are going to be, so many times I have been told “We want the fastest performance possible” without being told what the current IOPS requirement is, to put it into context what is the point in buying a 200mph sports car when the requirement is to drive at 70mph max!

IOPS requirement plays a key part in determining what level of vSAN Ready Node specification is required, for example if a total IOPS requirement is 300,000 IOPS out of a 10 Node cluster, is there much point spending more money on an All-Flash configuration that delivers 150,000 IOPS Per node?  Simple answer…No!  You could opt for a lower vSAN All-Flash Ready Node config that meets the requirements a lot closer and still offer
room for expansion in the future.

Workload Type
This is a pretty important requirement, for example if your workload is more of a write intensive workload then this would change the cache requirements, it may also require a more write intensive flash technology such as NVMe for example.  If you have different workload types going onto the same cluster it would be worthwhile categorizing those workloads into four categories:-

  • 70/30 Read/Write
  • 80/20 Read/Write
  • 90/10 Read/Write
  • 50/50 Read/Write

Having the VMs in categories will allow you to specify the workload types in the sizing tool (in the advanced options).

vCPU to Physical Core count
This is something that gets overlooked not from a requirement perspective, but people are so used to sizing based on a “VM Per Host” scenario which with the increasing CPU core counts does not fit that model any more, even the new sizing tool bases it on vCPU to Physical Core ratio which makes things a lot easier, most customers I Talk to who are performing a refresh of servers with either 12 or 14 core processors can lower the amount of servers required by increasing the core count on the new servers, thus allowing you to run more vCPUs on a single host.

List of questions for requirements for each workload type

  • Average VMDK Size per VM
  • Average number of VMDKs per VM
  • Average number of vCPU per VM
  • Average vRAM per VM
  • Average IOPS Requirement per VM
  • Number of VMs
  • vCPU to Physical Core Ratio

RAW Capacity versus Usable Capacity, how much do I actually need?
The new sizing tool takes all your requirements into account, even the RAID levels, Dedupe/Compression ratios etc and returns with a RAW Capacity requirement based on the data you enter, if you are like me and prefer to do it quick and dirty, below is table showing you how to work out based on a requirement of 100TB of usable (Including Swap File Space), based on a standard cluster with no stretched capacilities it looks like this:

FTT LevelFTT MethodMin number of hostsMultiplication FactorRAW Capacity Based on 100TB Usable

Now in vSAN 6.6 VMware introduced localized protection (Secondary FTT) and the ability to include or not include specific objects from the stretched cluster (Primary FTT), below is a table showing what the RAW Capacity requirements are based on the two FTT levels

Primary FTT LevelSecondary FTT LevelSecondary FTT MethodMin Number of hosts per siteRAW Capacity Based on 100TB Usable

Mixed FTT levels and FTT Methods
Because vSAN is truly a Software Defined Storage Platform, this means that you can have a mixture of VMs/Objects with varying levels of protection and FTT Methods, for example for Read intensive workloads you may choose to have RAID 5 in the storage policy, and for more write intensive workloads a RAID 1 policy, they can all co-exist on the same vSAN Cluster/Datastore perfectly well, and the new sizing tool allows you to specify different Protection Levels and Methods for each workload type.


It's all about VMware vSAN