In the last blog post I showed you what Object level metrics are and how they could be used, to fast get an idea to of how a VM is behaving. In this one I’m going to focus on what metrics can be used to get a better understanding of how a VM is behaving.
One of the hardest parts of performance troubleshooting is to know when you have good performance and when you have bad. Is 100% CPU usage for 2 hours a week bad ? Is 10 ms storage latency bad ? Maybe not. What if the blocksize is 1MB ? Is it then good ? As always the devil is in the detail. In the end vRops is only software, it doesn’t know what demands your organization have to a specific VM.
- Part 1: vRops Capability
- Part 2: Summary and Symptoms
- Part 3: Object Level Metric
- Part 4: Standard Deviation
Good and bad metrics
So there is no such thing as a bad metric! But there are some that are more useful, it more in the way that it’s being used then what metrics as such being used. If you know what each and every VM in your datacenter has of demands, then good for you, just skip this part. I personaly have never seen a datacenter small or large, where there were a Service Level Agreement (SLA) for each VM describing what kind of service, CPU, MEM, Storage etc. the VM needs. Without a SLA how would you know if it has good or bad performance ?
There is one way to get an idea. It’s called predictability or consistency – How consistent is the workload. This is also called standard deviation. How much does a workload deviate from the mean(average) value. I’m going to use this together with the max value or peak value.
Standard deviation and max values
I’m not going to go in and explain standard deviation, for that there’s google. What you should know is that it’s the deviation from a mean (average) value. Meaning that if we have a standard deviation of one and the mean is ten, the spread is going to be between nine and eleven. The standard deviation also show us something else which is interesting, not matter what the mean value is. It shows us how consistent a workload is. Take the example before, with a standard deviation of one, it tells us that the workload is very consistent. Of course you have to look at the context.
Just to give an example, I saw a cluster which had a datastore max latency of 5.700.000+ microseconds (more than 5 seconds), with a standard deviation of 200 ms. That’s a lot no matter how you look at it.
Let’s look at the table below, and I will give a few more examples of how to read these values, to benefit when troubleshooting. Again remember I’m currently focusing on VMs here, but there’s nothing that says this can’t be done on datastore/datastore clusters, for example to see which datastores are heavy hitters.
I will focus on three VMs in the view below – win7-01a, PVMWEB_2 and PVMAPP_0. First a word about the data, I have selected CPU usage %, twice, and changed the “transformation” from “Last” to “Maximum” and “Standard Deviation”. The same have been done for CPU usage in MHz, others will follow, just remember it’s the transformation that changes the output.
Starting with win7-01a, this is not an easy one – Low peak and low standard deviation, both speak to a low utilized CPU. Besides that not much to conclude about the VM’s CPU usage. If I’m being brave I would say that it’s a low mean value as well and it very consistent (68,8 MHz). Might be a spiky workload but not something you can really conclude on basis of the data at hand here.
Next up is PVMWEB_2. Given the high max value and the standard deviation being in the low end. This workload is very likely CPU bound at times, but not all of the time. For that to have been true, the standard deviation should have been lower.
Which bring me to PVMAPP_0. This one is clearly CPU bound and most be running at 100% most of the time, if not all the time.
So in order to validate what I just deducted from the numbers above, I pulled the CPU stats on the three VMs – I will let you decide how right or wrong I was…
Same as above now I have just moved on to memory. I have used to metric Guest demand, which is a calculated metric, this one should be used instead of active memory, when looking to rightsize. I’m not going to look into these numbers, the bottom once are very consistent, the top once not so much. PVMWEB_1 looks to be memory bound.
Looking a storage latency – Two thing become intimidate clear. One, this is a lab environment, and the bottom VMs are not doing any IOPS really (If latency had been sub ms, vRops would have change the numbers to microseconds and some data would have been visible). Two, write latency isn’t good. Read latency is fairly low and consistent across the VMs. Looking a write latency this is not the case, very high max values and very high standard deviation values. Seeing that this is across VMs, this speaks to an underlying problem. It might be that all VMs are on the same datastore or it might just be that the storage array can’t keep up.
The data that I have shown you are not real world data, but that doesn’t change the fact the using max value and standard deviation, show you very quickly how a VM is behaving on the given metric. I think we all have be asked the question, “how is well is this VM running” or “do this VM have any problems” or my favorite “the vm is performing poorly, can you give it 16 CPUs?”
This simply view will help you answer the question, without having to go in and look at a lot of metrics.
Before I end, just a side note. I haven’t used average values in these examples. Because I believe it will cloud the picture, more than it will help you, when doing quick assessments like the above view.
I’m going to close this blog post with a full view of the view below.
This is all of now – Let’s see if I find the time to continue this blog series another day…