Path failures on ESX4 with HP storage

Path failures on ESX4 with HP storage

Since we began upgrading our clusters to ESX4, we have been having strange “failed physical path” messages in our vmkernel logs.  I don’t normally post unless I know the solution to a problem, but in this case, I’ll make an exception.  Our deployment has been delayed and plauged by the storage issues that I mentioned in an earlier post.  Even though we have fixed our major problems, the following type errors have persisted.

Our errors look like this:

vmkernel: 19:18:05:07.991 cpu6:4284)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x410005101140) to NMP device “naa.6001438005de88b70000a00002250000” failed on physical path “vmhba0:C0:T0:L12” H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
vmkernel: 19:18:05:07.991 cpu6:4284)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device “naa.6001438005de88b70000a00002250000” state in doubt; requested fast path state update…

After several cases with VMware and HP technical support, we are no closer to resolving the issues.  VMware support, for its part, has done a good job of telling us what ESX is reacting to and seeing.  HP support, on the other hand, has been circling around the problem but has made little progress in diagnosing the issue.  We have had an ongoing case for several months and our primary EVA resource at HP has continually examined the EVAperf information and SSSU output that we have sent to HP for analysis.  Those have turned up nothing, and yet the messages continue from VMware.

The errors in the log make sense to me – we are losing a path to a data disk (sometimes even a boot-from-SAN ESX OS disk!) – but why HP cannot see anything in our Brocade switches or within the EVA is beyond me.   Our ESX hosts, whether blade or rack-mounted hardware, are seeing the problems across the board.  The one cluster we waited to upgrade never saw the issues in ESX3.5, but sees them now in ESX4.  And perhaps it is a VMware issue that is just too sensitive in monitoring its storage, but I suspect its something else.   The messages don’t seem to affect operation on the hosts, but it certainly makes investigating problems difficult when trying to determine what is a real problem versus just another failed path message.  Anyone else seeing this?