ScriptVMware

How I stopped worrying and love vShield Endpoint

Antivirus gone bad with a little help from vShield Endpoint

This was all the rage some years ago, offloading antivirus operation from the OS to the hypervisor with a little help from vShield Endpoint. To be honest I really like the idea. But after what I went through I would think twice before doing it again. Not saying it’s a bad solution, but the way the vendor implemented their own solution begs for questioning, why on earth?! Don’t they even know their own solution. The history goes…

Like I said offloading antivirus with vShield Endpoint and scanning all vms at the host level, instead of having one antivirus per vm and having the dreaded AV storm occur, was all the rage at that time. Only one or two vendors were avaliable at the time with the intergration to vShield Endpoint. A vendor was picked. Lets call them vendor A and the vendor was tasked to do the implementation. After a short test of a few hosts. Vendor A rolled the AV solution out to the rest of the host. As I recall it, its now over three years ago, the environment had around 40 hosts and 800 vms.

Some time pasted, a month or so later and suddenly host started disconnecting from vCenter. This was about that time I was hired, so this became one of my first tasks to get this fixed. At that time this wasn’t the only problem, but this problem was prioritized.

GSS was called up and a case made, GSS came back with the anwser that a heap was full. This was the one used by the AV solution.

As time passes the problem got worse and soon SSH and DCUI wasn’t available either.

Next up was making a case with the vendor A who was responsible for the AV product and solution as such. A case was made with vendor A and they some what quickly came back with some settings that should have been changed in order to support more than 25 vms on a host.

That didn’t really help us any further. I tried persistently to push them for a way out of this issue, but all they came back with was, reboot host (as most host were disconnected and ssh and dcui was dead, the only option was shutdown and power on host), reinstall host and clean up vmx files to remove it from vShield Endpoints protection and there for also the antivirus. As this meant downtime for lots of vm, this was hardly a viable option.

But as this were our only option at the moment. I started looking into a way to clean up the vmx files. When a vm is protected by vShield Endpoint you will see it in the vmx file where there will be multiple references to ‘vfile’. These lines needed to be removed.

I ended up creating the below bash script, which I ran from the ESXi console.

!!!This script and this technique is not supported, use it at your own risk!!!

There is no real error handling in this script or checks, which means that things did break using it. Usually the vmx file got overwritten with an empty data stream. I used KB1023880 to recreate the vmx files. The commen problem was locks on the vmx file. If the vmx file was locked when reading the file, the variable would be populated with an empty data stream and be manipulated and written back to the vmx file which lock was now removed and ready to be populated with new data – This means that in this example the vmx file would end up being blank. Recovery to the rescue. At other times there would be a lock on the vmx file and that lock wouldn’t be removed meaning that the file couldn’t be read from or written to.

The Script

I’m not going to go into details about what this script does, as there are actually some comments in the script. First it uses “vim-cmd vmsvc/getallvms” to get a list of all running vms on the host, then goes through the list one vm at a time and saves a copy of the vmx file and then it overwrites the original vmx file, without the lines containing ‘vfile’

#DOCUMENTATION AND CORRECTION FOR SERVERS WITH VFILE PROBLEM
#EDIT $PATH TO WERE YOU WANNA SAVE THE VMX AND SERVERLIST.CSV
path=/tmp/VFILE
vmxpath=$path/vmx
date=$(date +%d-%m-%y-%H:%M:%S)
#IF DIRS DOES NOT EXIST; CREATE THEM
[ -d $path ] || mkdir $path
[ -d $vmxpath ] || mkdir $vmxpath
vms=$(vim-cmd vmsvc/getallvms | sed -e 's/.*\[/\/vmfs\/volumes\//; s/Vmid.*//; s/\]./\//; s/\vmx.*/vmx/; /^\/vmfs\/volumes\/L1.*/!s/.*//; /^$/d;')
IFS=$''
#FOREACH LOOP MAIN CHANGES
#for vm in $vms
echo $vms | while read vm
do
#$files contains all vmx files
				file=$(cat $vm | grep -i 'vfile')
					if [ $file ]
				then
				#GET THE REAL VMNAME(NOT THE VMX/NAME ON THE DATASTORE)
				name=$(cat "$vm" | grep -i 'displayname' | sed -e 's/displayName.=."//;s/"$//')
				echo name:$name
				#GET THE VMX FILE NAME
				vmx=$(echo "$vm" | sed -e 's/\/vmfs\/volumes\/.*\///')
				#DOCUMENT REAL SERVERNAME AND PATH TO VMX FILE
				echo "$name","$vm" >> $path/serverlist_${date}.csv
				#SAVE A COPY OF THE VMX FILE IN A NEW LOCATION
				cat "$vm" > $vmxpath/$vmx
				#OVERWRITES THE VMX FILE WITH THE VMX FILE WITHOUT VFILE LINES
				cat "$vm" | grep -i -v 'vfile' > "$vm"
			elif [ !$file ]
		then
		#GET THE REAL VMNAME(NOT THE VMX/NAME ON THE DATASTORE)
		name=$(cat "$vm" | grep -i 'displayname' | sed -e 's/displayName.=."//;s/"$//')
		#DOCUMENT REAL SERVERNAME AND Status
		echo "$name",OKAY >> $path/serverlist_${date}.csv
	else
	#GET THE REAL VMNAME(NOT THE VMX/NAME ON THE DATASTORE)
	name=$(cat "$vm" | grep -i 'displayname' | sed -e 's/displayName.=."//;s/"$//')
	#DOCUMENT REAL SERVERNAME AND Status
	echo "$name",ERROR >> $path/serverlist_${date}.csv
fi
done

#ECHO WERE THE FILES WERE SAVED
echo "Path to files"
echo $path

 

I got schooled

All this is no good if there’s no go way of getting of the hosts and on to a clean host without vShield Endpoint. At first we discussed introducing a new clean host into a cluster, then clean the vms and shut them down in order to power them on once again on the clean host, to do this would mean minimal down time, but it would still be a huge job of getting all the customers to sign on for this, after all, all vms were working without issues (sort of). As doing a vMotion from/to a host which has vShield Endpoint on, will force vShield Endpoint to inject the vfile lines into the vmx file once again, which would have the vMotion fail if you were trying to do the vMotion from a host with vShield Endpoint to a host without, this wouldn’t be a viable way.

At the time we had a SE from VMware onsite to help with some of the issues and it was is guy who got us on the right track. I remember a meeting, we were only a few people, a project manager, two from the internal VMware team and the SE from VMware. I went through the ideas I had to fix the issue. Then at some point the SE asks the question “What happens during a vMotion” ? As I recall it, the two of us from the VMware team looked at each other, as to await a good answer from the other. As we both held off answering, the SE then trumped and said “This is a VCP question” OUCH! That one hurt.

Then came some more silence before he finally reviled his thoughts and what happens when a vMotion happens… The vmx file get reloaded or just read again by the new host. This would mean that if one could clean the vmx file of a VM from vfile references and then do a vMotion on to a host which hasn’t got vShield Endpoints protection, the vMotion should work.

 

The Solution

After some testing, it turned out to be our salvation. A host was prepped and added to a cluster, and then one host at a time the script was executed on and afterwards the vms were vMotion’ed to the clean host and lastly the host with vShield Endpoint were reinstalled. This way next to no vms failed this procedure and the few who did fail due to locks on the vmx file, could easily be retried or the vmx file recreated.

The one thing that i’m sure of, is that vShield Endpoint is a good solution, but watch out for third party integration and there impact on your vSphere environment.

3 thoughts on “How I stopped worrying and love vShield Endpoint

Leave a Reply

Your email address will not be published. Required fields are marked *