Apr 292011
 

I normally don’t post these.  But there are a few important symptoms that I personally have seen that look to be resolved by this patch.

2 new patches released Yesterday.  You can find the details for the ESXi patch in this KB Article and in this KB Article.  There are two very important security fixes in the patches but they also have some problem symptoms fixed as well.  Here’s some of the symptoms that are fixed (as per the KB articles at the time of this writing):

  • If you configure the port group policies of NIC teaming for parameters such as load balancing, network failover detection, notify switches, or failback, and then restart the ESXi host, the ESXi host might send traffic only through one physical NIC.
  • Virtual machines configured with CPU limits might experience a drop in performance when the CPU limit is reached (%MLMTD greater than 0). For more information, see KB 1030955.
  • The CPU usage of sfcbd becomes higher than the normal, which is around 40% to 60%. The/var/log/sdr_content.raw and /var/log/sel.raw log files might contain the efefefefefefefefefefefef text, and the /var/log might contain a SDR response buffer was wrong size message. This issue occurs becauseIpmiProvider might use CPU for a long time to process meaningless text such as efefef.
    This issue is seen particularly on Fujitsu PRIMERGY servers, but might occur on any other system.
  • ESXi hosts that are installed with NetXen NX 2031 devices might not show up under ESXi 4.1. When you view the NICs in the vSphere Client for the ESXi host, NetXen devices might not appear.
  • The Broadcom 5709 NIC that uses the bnx2 driver and the MSI-X might stop processing interrupts and terminate network connectivity under heavy workloads. This issue occurs because the driver drops the write to unmask the MSI-X vector by the kernel, when the GRC timeout value for such writes is too short.
    This patch increasing the GRC timeout value, preventing unexpected termination of network connectivity when the NIC experiences heavy workload.
  • If the NFS volume hosting a virtual machine encounters errors, the NVRAM file of the virtual machine might become corrupted and grow in size from the default 8K up to a few gigabytes. At such a time, if you perform a vMotion or a suspend operation, the virtual machine fails with an error message similar to the following:
    unrecoverable memory allocation failures at bora/lib/snapshot/snapshotUtil.c:856
  • Linux virtual machines with VMXNET2 virtual NIC might fail when the virtual machines are using MTU greater than the standard MTU of 1500 bytes (jumbo frames).
  • If you are using a backup application that utilizes Changed Block Tracking (CBT) and the ctkEnabled option for the virtual machine is set to true, the virtual machine becomes unresponsive for up to 30 seconds when you remove snapshots of the virtual machine residing on an NFS storage.
  • An ESXi host connected to an NFS datastore might fail with a purple diagnostic screen due to a corrupted response received from the NFS server for any read operation that you perform on the NFS datastore, displaying error messages similar to the following:
    Saved backtrace from: pcpu 16 SpinLock spin out NMI
    0x4100c00875f8:[0x41801d228ac8]ProcessReply+0x223 stack: 0x4100c008761c
    0x4100c0087648:[0x41801d18163c]vmk_receive_rpc_callback+0x327 stack: 0x4100c0087678
    0x4100c0087678:[0x41801d228141]RPCReceiveCallback+0x60 stack: 0x4100a00ac940
    0x4100c00876b8:[0x41801d174b93]sowakeup+0x10e stack: 0x4100a004b510
    0x4100c00877d8:[0x41801d167be6]tcp_input+0x24b1 stack: 0x1
    0x4100c00878d8:[0x41801d16097d]ip_input+0xb24 stack: 0x4100a05b9e00
    0x4100c0087918:[0x41801d14bd56]ether_demux+0x25d stack: 0x4100a05b9e00
    0x4100c0087948:[0x41801d14c0e7]ether_input+0x2a6 stack: 0x2336
    0x4100c0087978:[0x41801d17df3d]recv_callback+0xe8 stack: 0x4100c0087a58
    0x4100c0087a08:[0x41801d141abc]TcpipRxDataCB+0x2d7 stack: 0x41000f03ae80
    0x4100c0087a28:[0x41801d13fcc1]TcpipRxDispatch+0x20 stack: 0x4100c0087a58
  • An ESXi host might stop responding if one of the mirrored installation drives that is connected to an LSI SAS controller is unexpectedly removed from the server.
  • If there are read-only LUNs with valid VMFS metadata, rescanning of VMFS volumes might take a long time to complete because the ESXi server keeps trying to mount the read-only LUNs till the mount operation times out.
  • When you simultaneously start several hundred virtual machines that are configured to use the e1000 virtual NIC, ESXi hosts might stop responding and display a purple diagnostic screen.
  • When you migrate a virtual machine or restore a snapshot, you might notice a loss of application monitoring heartbeats. This issue occurs due to internal timing and synchronization issues. As a consequence, you might see a red application monitoring event warning, followed by the immediate reset of the virtual machine if the application monitoring sensitivity is set to High. In addition, application monitoring events that are triggered during the migration might contain outdated host information.
  • When USB devices connected to EHCI controllers get reset, memory corruption might sometimes cause the ESXi host to fail with a purple screen and display error messages similar to the following:
    #GP Exception 13 in world 4634:vmm0:sofsqle @ 0x418015eab7a1
    86:01:33:10.118 cpu6:4634)Code start: 0x418015a00000 VMK uptime: 86:01:33:10.118
    86:01:33:10.119 cpu6:4634)0x417f810d77e8:[0x418015eab7a1]ehci_urb_done@esx:nover+0x34 stack: 0xffffff8d
    86:01:33:10.119 cpu6:4634)0x417f810d7858:[0x418015eabfd7]qh_completions@esx:nover+0x406 stack: 0x10
    86:01:33:10.119 cpu6:4634)0x417f810d78d8:[0x418015ead4b1]ehci_work@esx:nover+0xb8 stack: 0x1
    86:01:33:10.120 cpu6:4634)0x417f810d7928:[0x418015eafaa4]ehci_irq@esx:nover+0xeb stack: 0x417f810d79d8
    86:01:33:10.120 cpu6:4634)0x417f810d7948:[0x418015e99f3d]usb_hcd_irq@esx:nover+0x2c stack: 0x5
  • When SCO OpenServer 5.0.7 virtual machines with multiple vCPUs are installed on a virtual SMP enabled ESXi Server, the SCO OpenServer 5.0.7 virtual machines start with only one vCPU instead of starting with all the vCPUs.
  • When a Storage Virtual Appliance (SVA) presents an iSCSI LUN to the ESXi host on which it runs, occurrence of VMFS lockup might cause I/O timeouts in the SVA, resulting in messages similar to the following in the /var/log/messages file of the SVA virtual machine:
    sva1 kernel: [ 5817.054354] mptscsih: ioc0: attempting task abort! (sc=ffff88001f95b580)
    sva1 kernel: [ 5817.054360] sd 0:0:1:0: [sdb] CDB: Write(10): 2a 00 00 15 30 40 00 00 40 00
    sva1 kernel: [ 5817.182134] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88001f95b580)
  • If you configure primary or secondary private VLAN on vNetwork Distributes Switches while a virtual machine is migrating, the destination ESXi host might stop responding and displays a purple diagnostic screen with messages similar to the following:
    #PF Exception 14 in world 4808:hostd-worker IP 0x41801c96964a addr 0x38
    VLAN_PortsetLookupVID@esx:nover+0x59 stack: 0x417f823d77a8
    PVLAN_DVSUpdate@esx:nover+0x5cb stack: 0x0
    DVSPropESSetPVlanMap@esx:nover+0x81 stack: 0x1
    DVSClient_PortsetDataWrite@vmkernel:nover+0x78 stack: 0x417f823d7898
    DVS_PortsetDataSet@vmkernel:nover+0x6c stack: 0x1823d78f8 

     

  • When VMware Fault Tolerance is enabled on a virtual machine and the ESXi host that runs the secondary virtual machines is powered off unexpectedly, the primary virtual machine might become unavailable for about a minute.

If you have been experiencing any of these symptoms, time to get your change management form filed and patch at will.

Leave a Reply