Script

Alternative to VDS Health Check

If you have a large environment or a lot of VLANs you know that running with VDS Health check on, is not an option. This is due to the way VDS Health check works. It will sent out one packet per uplink per VLAN. Meaning if you have two uplinks and 20 VLAN, it will send out a total of 40 packets for that one host. Which is not that bad, but now scale it up to the kind of people which usually have and need the distributed switch and the number of VLANs is likely to be triple digit or more. Lets say they have a 1000 VLANs time two nics, again it does not sound that bad, but it is! It also means that 2000 mac addresses get added to the switch, which means that your switch mac address table is probably full by now. That again means mac addresses from you production workloads (VMs) get purged and arp requests have to done in order for the switch to find out where to sent the packets. This leads to a lot of flooding on the network. Which is bad.

A real world example of that, is a company having issues with getting access to the internal test labs. The first time they hit the address of the environment the query just timed out. They wait 10 seconds and suddenly the service is available. Simply put the mac address of the destination was not in the switch mac address table and it had to flood to get it. This is not something you want to have!

So we do not want full mac tables or flooding, but is there an alternative?

 

Passive collection

The before mentioned VDS Health check is an active solution probing. It send out packets to check whether there is network access, on all given VLANs on all available uplinks. This is why you see the problem described above when enabling the VDS Health check and not in a normal production environment.

There is another solution, which passively monitors which VLANs have been used over a period of time. The namespace is “esxcli network nic vlan stats”. This namespace can be used to “set” or “get” data around the usage of the VLANs used on a per host basis. So data is collected based on the usage of the VLAN, meaning that this method is not as accurate as active probing and never will be. When that is said it can give you an idea as to whether or not the VLANs are accessible on your hosts. It is just highly dependent on how much network chatter that is on all VLANs.

In worst case scenarios it is guideline to help you determine where there might be issues worth looking into.

 

Before I go into details with this command. I had a chat with VMware GSS to hear more about the possibilities of using this method for collection data. They warned that this feature should not be left on for all eternity, but rather for brief moments. There was no real reason for this advise it seemed. Other than it had not be tested, so it could potentially lead to memory leaks or instability. I have not noticed any issues what so ever with using this method, but now you have been warned!

 

Alternative to VDS Health Check in action

I have created this script using Powershell, which is one of my preferred programming languages. The equivalent of esxcli in PowerCli terms is Get-EsxCli, which takes a host and a namespace as parameters to the command. In this case it is “Get-EsxCli.(network.nic.vlan.stats.get)” or “Get-EsxCli.(network.nic.vlan.stats.set)” this is being used. The script it self is not much to talk about. It is rather simple, but with a few testes to see that the statistical settings has been turn off or on.

Also to help reduce the time the statistical settings is enabled, there is a built in timer. I have set it per default to 24 hours. You can test to see if other settings works better for you.

There are comments in the script it self which explains what is going on or if something went wrong. I will not go over these. In the end the script outputs a list of VLANs which is in the distributed switch, but not found in the data collected. These are the VLANs and hosts of which you not can investigate manually, with the network team, to resolve if the uplinks are correctly configured.

 

The script is as always available in github and can be seen below. Thank you and take care.

 

Leave a Reply

Your email address will not be published. Required fields are marked *