Just wanted to do a short post about vmkmgmt_keyval, as I think most admins only uses this tool for getting HBA driver/firmware versions, but you get so much more than just that. There is different forms of statistics in there, per target, per lun and block size.
So first what is vmkmgmt_keyval ? vmkmgmt_keyval is an ESXi command which can be executed from ESXi’s direct console user interface (DCUI) or via SSH, in both examples you get access to ESXi Shell.
The way I’m going to use vmkmgmt_keyval, is with an “-a”, after, just like in the example here: “/usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -a”. What this does it list all key value instances. If you want help “-h” can be used instead of “-a”.
So besides getting driver/firmware of HBA’s, there’s some data around the HBA, like queue depth, link speed, etc. But this is not what this post is about. I think the IO stats are much more important.
What is it that you can get from this? – First lets look at the data available on a per target basis.
Tgt00 WWNN 00:00:00:00:00:00:00:00 WWPN 00:00:00:00:00:00:00:00 Target path is ok IOStat: max 0090 pend 0000 txcnt 8317318393 IOErr: busy 0000 retry 0000 seq_tmo 0000 TMGMT: tgt_rst 0000 lun_rst 0000 ABORT: issue 000000 IOcnt 000000 Events: npr 0001 devloss 0000 no_connect 0000 LCLRJT: nrsc 0000 inv_rpi 0000 lcl_rjt 0000 FRAME: drop 0000 underrun 000 overrun 0000 scsidone 0000
What you see above is an example of a target and its stats. The names of the stats, can be a bit cryptic, but with a little storage know how this shouldn’t be too hard. Suddenly we can easily see if/what target is having problems.
Going forward to per lun stats – We here see queue depth, fcp errors, abourts and lun resets, which can be used to help troubleshoot.
LUN[0:0] WWNN 00:00:00:00:00:00:00:00 WWPN 00:00:00:00:00:00:00:00 path is ok qdepth 30 fcperr 0005 abts issue 000000 cnt 000000 lun_rst 0000 tx_cnt 5892684093
The data we just looked at has one problem – it aggregated and we don’t have an idea of when the problem occurred. This is why management tools like vRealize Operations Manager (vRops) and vRealize Log Insight exists. Both are tools which can help analyze, these kind of problems.
This brings me to the last part of the vmkmgmt_keyval output, which is block size and latency. Again here comes an example.
lpfc IOStats Page: Snapshot Total -------- ----- IOPrd IOPwr MBrd MBwr IOPrd IOPwr MBrd MBwr Tgt00 WWNN 00:00:00:00:00:00:00:00 WWPN 00:00:00:00:00:00:00:00 3426.5 53.1 3353.3 29.3 571.2 58.2 540.0 8.3 size[ 512 - 512 ] cnt 20272480 avg 1ms size[ 1024 - 1536 ] cnt 11490892 avg 0ms size[ 2048 - 3584 ] cnt 10001518 avg 0ms size[ 4096 - 7680 ] cnt 843186616 avg 0ms size[ 8192 - 15872 ] cnt 5342245988 avg 0ms size[ 16384 - 32256 ] cnt 1130441743 avg 0ms size[ 32768 - 65024 ] cnt 278804303 avg 1ms size[ 65536 - 130560 ] cnt 200924556 avg 1ms size[131072 - 261632 ] cnt 295706725 avg 1ms size[262144 - 523776 ] cnt 181901841 avg 4ms size[524288 - 1048064 ] cnt 02161818 avg 7ms size[1048576 - 2096640 ] cnt 00011727 avg 5ms
The interesting part here is the historygram, it shows block size, count and avg. latency – So now for each block size interval you can see the latency and see what this datastore it serving and the impact it has on latency. Again this is a good indication of how the storage/datastore is serving the VMs on top of it, but it cant be used for profiling the VMs or knowing which VM might be impacted by “poor” storage performance.
If there’s a need to understand storage better, there are better tools than vmkmgmt_keyval. Such a tool is Pernix Data, Architect – I’m not going to do a write up about Architect as others have already done that very well. Take a look at Pete Koehler’s blog post called Viewing the impact of block sizes with PernixData Architect.
This was all for now, I hope that I made my self pretty clear – vmkmgmt_keyval is tool to get better inside to your storage, but there are even better tools which does this with more precision and more granular.