Nuffnang

Sunday, June 5, 2011

QLogic HBAs and other PCI devices may stop responding in ESX/ESXi 4.1 when using Interrupt Remapping for Vsphere 4.1

This is very new bug in Vsphere 4.1 ESXi as Vmware release the solution on Apr 21, 2011.

I had facing 2 downtime since we update to Vsphere 4.1 since May 2011.

Please take on this.

When using Interrupt Remapping on some servers, you may experience these symptoms on ESX/ESXi 4.1:

HBAs stop responding
Other PCIs devices may also stop responding
You see an an illegal vector shortly before an HBA stops responding to the driver. For example:

vmkernel: 6:01:34:46.970 cpu0:4120)ALERT: APIC: 1823: APICID 0x00000000 - ESR = 0x40

The HBA stops responding to commands. For example:

vmkernel: 6:01:42:36.189 cpu15:4274)<6>qla2xxx 0000:1a:00.0: qla2x00_abort_isp: **** FAILED ****
vmkernel: 6:01:47:36.383 cpu14:4274)<4>qla2xxx 0000:1a:00.0: Failed mailbox send register test

The HBA card gets marked offline. For example:

vmkernel: 6:01:47:36.383 cpu14:4274)<4>qla2xxx 0000:1a:00.0: ISP error recovery failed - board disabled

Note: This issue only applies if you see the specific alert: ALERT: APIC: 1823: APICID 0x00000000 - ESR = 0x40 in the vmkernel/messages log files. If you do not have this message, you are not experiencing this issue.


Resolution
This issue is currently under investigation by VMware engineering.

ESX 4.1 introduces interrupt remapping code that is enabled by default. This code is incompatible with some servers. You can work around this issue by manually disabling interrupt remapping on the affected servers.

To disable interrupt remapping, perform one of these options:

Run the commands:

# esxcfg-advcfg -k TRUE iovDisableIR
# reboot

To check if interrupt mapping is set after the reboot, run the command:

# esxcfg-advcfg -j iovDisableIR

iovDisableIR=TRUE

In vSphere Client:
Click Configuration > (Software) Advanced Settings > VMkernel.
Select VMkernel.Boot.iovDisableIR and click OK.
Reboot the ESX host.

2 comments:

  1. does it a action that we must do if we receive such error messages?

    ReplyDelete
    Replies
    1. Hi Fung,

      Sorry for later reply. Yes you must apply the above step in order for you to over come this problem if not you physical ESXi host may be hang and no responding due to unable connect to share storage.

      Delete