Tag Archives: Performance

Day-2 Operations – vSphere built in vROPS dashboards

At VMworld I ran a few sessions on Day-2 Operations which I also covered the new built in dashboards for vROPS which were introduced with the fully baked HTML5 client in vSAN/vSphere 6.7.  Many people were not aware of the dashboards and moreso were not aware that these vSAN specific dashboards continue to work even after the 90 day trial period has expired.  Not only that but VMware has done a great job in automating a deployment of vROPS if you have not already got an appliance deployed.  So let’s take a look at these new dashboards in a bit more detail.

Firstly the three vSphere operations Dashboards, the default one that loads up is an overview dashboard of all your clusters:

Then there is the Cluster Level View where you pick a specific cluster:

And finally is the Alerts view…Great timing, I have a physical disk failure in the vSAN Cluster:Now the above three dashboards will no longer be available after the 90 day trial period has expired, and also the link to the vRealize Operations appliance will not be functional either, but after the 90 day trial period has expired, the following three vSAN Dashboards will still be fully functional and available, so let’s look at those in a bit more detail:

The vSAN Overview dashboard, like the vSphere overview dashboard, displays information at a holistic level for all of your vSAN Clusters within this particular vCenter server, you will see that the dashboard provides information on how many clusters are running dedupe/compression, or how many of the clusters are a Stretched Cluster for example.  The dashboard also shows if you need to investigate any current alerts (yes I cleaned up the failed disk before grabbing this screenshot).

The next dashboard we choose a specific cluster to look at in more detail:
In this dashboard we see information pertaining to a specific cluster, we can see that I have 6 critical alerts which we will take a look at next, but there are some key metrics here that from an operational perspective are pretty important from a day-2 operations standpoint:

  • Remaining Capacity
  • Component Limit
  • IOPS, Throughput and Latency statistics
  • Read versus Write latency

The last dashboard is alerts:
Here we can see the current alerts that have been triggered for each cluster which may need to be addressed, the critical alerts previously highlighted in the cluster view were all related to a network redundancy lost when I was troubleshooting packet loss on the physical switch.

So as you can see there’s a really good amount of detail in the vSphere UI relating to vROPS making the day-2 operations a lot easier to perform.

 

 

 

 

Day-2 Operations – Performance Monitoring in the vSAN UI

Performance reporting or Performance Monitoring is something of a must in any storage environment today, I remember many times when I was in VMware Support when facing a customer storage performance issue that metrics were not there to capture the event, and most storage performance tools required enabling, obviously this meant that the issue had to be occurring at the time of the performance metrics grab, vSAN in the early days was no different, vSAN Observer whilst being a detailed tool and provided a lot of information, was not a historical tool, it was enabled to troubleshoot a performance issue that was happening at that particular time.

In the later releases of vSAN the UI came equipped with more performance metrics than you could shake a stick at, which from a performance troubleshooting and monitoring perspective is the dogs danglies, but what does this mean from an every day perspective?  Before we take a look at the UI, there are three areas where vSAN Performance metrics can be displayed

  • Cluster level – This is the performance metrics aggregated for the whole cluster and allow you to have the high level view of how your cluster is performing as a whole
  • Host Level – This allows you to look at how vSAN is performing on a host by host perspective and contains further information drilling down through things like Disk Groups, Physical Disks, Network Controllers, VMkernel interfaces.
  • VM Level – This focuses on a specific virtual machine and the objects associated with it.

So what information do we have exactly for a given observation level?  Well let’s first of all take a look at the cluster level performance information, there are three options under the performance tab for vSAN

  • vSAN – Virtual Machine Consumption
  • vSAN – Backend
  • vSAN – iSCSI

I have met many customers that immediately notice there is a big difference between the Virtual Machine Consumption and the Backend graphs, so before we go any further let’s talk about what each of these specific areas mean.

Virtual Machine Consumption
These graphs represent the values that objects residing on the vSAN Datastore are seeing, now remember everything that exists on vSAN is an object, so consumers that are counted in these graphs are Virtual Machines, Stats Objects etc

Backend
These graphs represent the backend disks associated with vSAN, cache and capacity

Both sets of graphs cover the statistics for:

  • IOPS
  • Throughput
  • Latency
  • Congestion
  • Outstanding I/O

iSCSI
The iSCSI Performance graphs contain all the graphs above with the exception of Congestion, these graphs are in relation to each iSCSI Target/LUN created and each one is selected in turn to review the performance graphs associated.

If we move our focus to a host level, in here we have a number of options in addition to the three we also see at a cluster level, however there is some additional metrics we get at a host level for Virtual Machine Consumption and Backend, in the Virtual Machine Consumption graph we have Local Client Cache Hit IOPS and Local Client Cache Hit Rate

And under Backend we also have some additional graphs for Resync IOPS, Resync Throughput and Resync LatencyThe resync metrics are extremely important if vSAN is recovering from a failure of some sort and performing a resync of degraded components, it is also important if you are performing a pro-active rebalance, policy change or a full data migration during host or disk/diskgroup evacuation.

The other options listed under host vSAN performance are:

  • Disk Group – Shows the performance graphs for the disk groups, I will cover this below as this is one of the most interesting set of metrics with a lot of detail.
  • Disk – Shows the physical disks in the host reporting on IOPS, Throughput and  Latency
  • Physical Adapters – Shows the network stats for each vmnic associated with vSAN, stats include Packet Loss Rate which is good for troubleshooting networking issues
  • VMkernel Adapters – Shows the statistics for each VMkernel configured for vSAN, this also includes a Packet Loss Rate which you can then use to troubleshoot the software network stack
  • VMkernel Adapters Aggregation – This is an aggregation of all VMkernel interfaces being used for vSAN on the host

Now let’s go back to the Disk Group performance graphs, as I said earlier this is a very interesting group of metrics to explore, so what do we have in this group?  The first section is all about Frontend (Guest) IOPS, Throughput and Latency

The frontend statistics are maybe as you have guessed already, they are related to vSAN Object I/O being generated from guests running within the vSAN cluster.  If we scroll down a little further we can see statistics relating to Overhead IO,  Read Cache Hit Rate (for hybrid) and Evictions:

Further down we have statistics relating to the vSAN Write Buffer and De-stage rate clearly showing how much of the write buffer is free and also how quickly data is being de-staged from cache to capacity, now we also have resync metrics under disk groups, however this differs slightly against the Cluster Wide Backend statistics, in the disk group graphs we actually have values that represent various aspects of the Resync Operations, the graph differentiates between:

  • Policy Changes
  • Repairs
  • Rebalance

So you can easily distinguish what resync operations are happening by the statistics within the disk group stats.

Collection Interval
The vSAN Performance metrics collect the sample every five minutes, and this is an average over that five minute period, if your cluster is hardly doing anything (like my cluster for the screenshots) then this can throw out some of the latency numbers, in my own cluster I have noticed it shows higher latency when doing practically nothing than it does when I start putting load on the cluster.  I have spoken to many customers about this, it is no concern, it just means that during the collection sample maybe a few “Large” IO operations returned a larger Latency and because of the low number of samples, this skews the average, so no cause for alarm on that one.

Up next will be Day-2 Operations, Performance Monitoring with vROPS, I just have to write it first 🙂

Why HCI Matters in the Datacenter and Beyond

Technology is changing and evolving at an ever-increasing pace, whether you are a consumer of electronics, or you are a CEO of a large organisation with a large IT infrastructure, the changes in technology affect us all in different ways.  An example of this is CPUs and Flash Storage, we’re now at an era of constantly increasing CPU Core densities, and Flash Storage is becoming bigger and faster, these technology transformations are not only changing the way we operate as human beings in our own personal IT bubbles at home, but also within organisations too.

As organisations large and small take on the whole business transformation, a key element of the business transformation is their IT, whilst the last 15 years IT was more focused around being IT centric with traditional applications and the wide adoption of the internet.  The next 15 years poses some challenges as IT becomes more business centric along with cloud applications and the Internet of Everything.

A key enabler to the whole IT transformation is the Software Defined Data Center, many of you would have heard me talk about the Software Defined Data Center not as an object, but more as an Operating System that runs your IT infrastructure, if you are asked what three things are required to run an operating system?   You’ll find yourself answering storage, compute and networking for connectivity, which is essentially the three key elements that make up the Software Defined Data Center.

Hyper-Converged Infrastructure allows you to deliver capabilities that underpin the whole Software Defined Data Center based on a standard x86 architecture and offers a building block approach, it also brings the storage closer to the CPU and Memory which in a virtualised environment is highly benefitial and it is more VM centric rather than being storage centric.

So why is HCI being adopted by the masses?

There are a number of reasons for this, we’ve already outlined the fact that having the storage closer to the compute delivers a much more efficient platform, but outside of that there is a Harware Evolution which is driving the changes in infrastructure, rather much like an Infrastructure Revolution.

Higher CPU Core densities means you can run much more dense workloads, in conjunction with this, RAM has become much comoditized, affordable and available in larger capacity.  From a storage aspect Flash has evolved in such a way that is has enabled the delivery of high capacity and high performing devices that only a few years ago would have took a whole refrigerator sized array to produce but now can be delivered by a device that you can hold in the palm of your hand.  Another aspect from the storage side of things is that traditional storage is unable to keep up with the demands of applications and IT, this resulted in a new approach to storage and infrastructure….HCI


What is required from your storage platform?

I have met with many customers in various meetings or events, and depending on who you talk to in the organisation you will get a different answer to that question

  • Application Owner – Performance and Scalability
    They need to deliver an application that performs well as well as offers scalability, so the storage has to be able to offer this.
  • Infrastructure Owner – Simplicity and Reliability
    They need the platform to be simple to deploy, simple to manage but also offer reliability, they don’t want to be getting calls in the middle of the night from the Application Owner!
  • CFO / Finance Team – Lower Cost and Operational Efficiency
    There’s always somebody looking at the numbers and it’s usually this side of the organisation, reducing TCO, CAPEX, OPEX and making IT more cost effective is the biggest driver here.

Everyone is aiming for that sweet spot where all three circles converge the only problem is, with traditional infrastructure, you can never satisfy all three of the above requirements, there’s usually one requirement that has to be sacrificed, and that’s usually the Finance Team or CFO that has to back down in order to deliver the requirements for the Application Owner and the Infrastructure Owner,  this is where HCI is different, HCI brings everyone to that central convergence and meets the goals of all the requirements, so now everyone is happy, lets take a closer look at how HCI powered by vSAN meets these requirements


vSAN HCI delivers an architecture that not only delivers on performance but it is scalable simply by adding more nodes, or by adding more storage, it also allows for linear scaling of performance.  This means as your IT or business applications scale and demand more capacity or performance, then this is easily delivered in whatever increments meet the requirements at that point in time.


vSAN HCI allows the infrastructure team to deploy and manage environments at a simple management plane in a single interface, no separate management tools are required which means there’s no extensive retraining of staff required.  Reliability and Resiliency are built in with the ability to protect from a Disk Level all the way up to a Site level.


We’ve already talked about how HCI offers a building block approach, this means environments can be built to meet your requirements now and be grown as and when required.   Because there’s a much simpler management plane, this means operational efficiencies come into play as well, offering a more streamlined approach to IT

At this point we have met all of the criteria set by the three key stakeholders, but the benefits of HCI don’t just stop there there are other positive impacts that HCI brings to your organisation:


vSAN HCI offers a much wider of choice in the hardware that can be used along with different hardware vendors to choose from, there is also the range of different deployment options, this allows organisations to have a lot more flexibility on how they adopt HCI as well as having choices for newer hardware technology at their fingertips, this includes:

  • vSAN Ready nodes from all major server OEM vendors to suit all performance and capacity requirements
  • Turnkey appliance solution from Dell EMC which is VxRAIL
  • VMware Cloud Foundation which incorporates a full SDDC Stack

For deployment options, vSAN HCI offers the following:

  • Standard clusters up to 64 Nodes
  • Remote Office / Branch Office (ROBO) Solutions for customers with multiple sites
  • Stretched Cluster Solutions
  • Disater Recover Solutions
  • Rack Level Solutions
  • Same Site “Server Room” configurations


vSAN HCI allows organisations to become more agile by allowing  faster deployments, faster procurement and giving more control back to the business, which in a competitive world is a key enabler to success

As you can see, no matter what the size of your IT infrastructure is, HCI brings a wealth of benefits, from large scale data center deployments, to multi site ROBO deployments, there’s a perfect fit for HCI