Troubleshooting with vRops part 2: Summary and Symptoms

In the first part of this series on vRops, I talked about vRops capabilities compared to the vRops edtion. In this part I will start to dig into how you should start looking at a problem the vRops way. I my self have fallen victim to, all too quickly jumping over to the “all metrics” tab, and start throwing graph up on screen to see if I can find the cause of the issue. The problem with that method is that it is highly inefficient and like looking for a needle in a haystack, it can take quite some time and the result may not have gotten you any closer to finding the problem… Why there are haystacks with needles in them ?

Continue reading to find out how summary and symptoms tabs can help you identify problems faster.

Blog series

Summary

The summary tab, will be the first tab you see when ever you click on an object in vRops, be it vSphere world as seen below or a VM as I will show later or any other object for that matter. Summary is where we start.

Above is a screenshot of the summary tab of vSphere World, first thing to note is that all is green. That doesn’t mean there isn’t problems, it just means that overall vSphere world, with it’s sub components (VM, datastore etc.) is mostly good and has no major errors which impact the environment as such. Also note no alerts for the object it self. Looking at the bottom of the picture, all the errors/fault of subcompacts are listed. So you see just how quickly one can get an better understanding of the issues that may have an impact on how the customer feels there VMs are running.

On a side note, these alert are not only reactive, some of them are even proactive, like datastore is running out of disk space warnings. Catch the issues before they turn into incidents or problems.

Symptoms

Where summary tab show alerts, which may or may not be useful to you, the symptoms tab show info about symptoms on which alerts can be triggered. There for the symptoms tab is a very powerful way to start troubleshooting a given issue. I believe I have dragged you along for long enough now. Better start troubleshooting then.

Customer A has complained about a VM, which doesn’t run that well and where the OS GUI isn’t responding well either. It’s all in all a bit sluggish, the customer notes.

Let’s jump in. First look at the summary tab like I just told you was a good idea.

So what can be seen here is that there is a problem with health and risk, but no alerts for some reason. Also note that just because everything is not red or green doesn’t mean that there can’t be a problem, just that it isn’t having that large an impact on the object.

As this part is about symptoms let’s jump to the symptoms tab and see what is going on there.

As the list is quite large for this VM, I have split it, so I better can discus the different parts. Looking at the screenshot above, it should be clear that there is quite a few error regarding CPU and memory and a single warning that a disk in the near future might run full.

Scrolling below the errors we get to the above listings of informational bullets. The first thing to note here is the first three lines concerning CPU. These lines basically tells you that it not IO related, not swap related and the host doesn’t have a problem scheduling this VMs demand for CPU. Meaning we just ruled out IO, swap and host related issues.

Just gonna skip the recommendations for now, will get back to them in a moment. The CPU and memory, time reaming lines, indicates that these resource are low, which brings me back to the recommend lines (Skipping the “more than 1 vCPU” line altogether)

As it can be seen above, there is a few more columns to look at, the last one is the I am going to focus on. Note the two lines with the CPU and memory recommended – It clearly state how much vCPU and memory the VM should have. Since we have already learned that it a CPU/memory demand issue and the host doesn’t seem to be the blamed. We have a solution. Add the needed resources and it should start behaving again.

So there you have it – We solved the problem. Just to make sure I’m right – Let’s verify the numbers.

So I pulled some more numbers(only on CPU), run through them, but I’m sure they agree with me – vRops sees a need for speed.

Summary and Symptoms wrap up

Summary is the quick overview and symptoms is the needle in the haystack. Most times summary should give you a hint to what problems the VM is having, but that might not always be the case or an alert might not be triggered on the particular issue. vRops is a big step forward and has a lot of alerts and warnings that vCops didn’t come with. When that’s said symptoms is another great way to dig a bit deeper to find the root cause of the issue you are looking to solve.

Next up is Object Level Metric – A fast way to get an overview of how an object is behaving.

Next: Part 3: Object Level Metric

3 thoughts on “Troubleshooting with vRops part 2: Summary and Symptoms”

Pingback: Troubleshooting with vRops part 1: vRops Capability - Michael Ryom
Pingback: Newsletter: November 14, 2015 | Notes from MWhite
Michael Monberg says:

20 November, 2015 at 09:54

Nice approach! You are so right about always diving straight into All metrics…
I will surely use this in my next presentation.

Michael Ryom