Time Stamp: Jan 6 06:10:16 and Jan 13 06:10:30 local time.
Hostname: esx1.vmware.com
Version: VMware ESX 4.0.0 build-164009
Server Model: Specification Version: 2.0
Vendor: IBM Corp.
Version: -[D6E124AUS-1.01]-
Release Date: 04/30/2009
Service Console Mem (Cfg).............................358 Megabytes
Log Analysis:
1. We noticed that the server was abruptly rebooted at the hardware level based on the events generated in the vmksummary logs for both restarts. This can happen due to power outages, faulty components, and heating issues. To investigate further, engage the hardware vendor.
vmksummary logs
Jan 6 05:01:16 esx1 logger: (1325797276) hb: vmk loaded, 4454872.10, 4454865.986, 12, 164009, 164009, 96, vmware-h-95196, sfcbd-19696, sfcbd-13140
Jan 6 06:10:16 esx1 vmkhalt: (1325801416) Starting system...
Jan 6 06:11:04 esx1 logger: (1325801464) loaded VMkernel
Jan 6 07:01:15 esx1 logger: (1325804475) hb: vmk loaded, 3124.71, 3118.588, 0, 164009, 164009, 0, vmware-h-68388, sfcbd-12920, sfcbd-8044
Jan 13 05:01:13 esx1 logger: (1326402073) hb: vmk loaded, 600722.02, 600715.898, 12, 164009, 164009, 84, vmware-h-83112, sfcbd-13112, sfcbd-8212
Jan 13 06:10:30 esx1 vmkhalt: (1326406230) Starting system...
Jan 13 06:11:17 esx1 logger: (1326406277) loaded VMkernel
Jan 13 07:01:14 esx1 logger: (1326409274) hb: vmk loaded, 3109.67, 3103.543, 0, 164009, 164009, 0, vmware-h-68220, sfcbd-12964, sfcbd-8112
2. We do not see events being generated from 5:38 AM to 6:10 AM during the point of time which means that the host is unresponsive at the hardware level based on vmkernel logs.
vmkernel logs
Jan 6 05:38:32 esx1 vmkernel: 51:14:05:02.600 cpu2:20130)ScsiScan: 842: Path 'vmhba36:C0:T1:L0': Type: 0x10, ANSI rev: 3, TPGS: 1 (implicit only)
Jan 6 05:38:32 esx1 vmkernel: 51:14:05:02.600 cpu2:20130)ScsiScan: 105: Path 'vmhba36:C0:T1:L0': Peripheral qualifier 0x1 not supported
Jan 6 06:10:26 esx1 vmkernel: cpu0:0)ACPI: 912: SRAT proc entry nodeID=0x00 apicID=0x01
Jan 6 06:10:26 esx1 vmkernel: TSC: 34637036 cpu0:0)ACPI: 912: SRAT proc entry nodeID=0x00 apicID=0x03
Jan 6 06:10:26 esx1 vmkernel: TSC: 34640068 cpu0:0)ACPI: 912: SRAT proc entry nodeID=0x00 apicID=0x05
Jan 13 05:39:19 esx1 vmkernel: 6:23:30:02.601 cpu14:4188)ScsiScan: 105: Path 'vmhba36:C0:T1:L0': Peripheral qualifier 0x1 not supported
Jan 13 06:10:40 esx1 vmkernel: ID=0x10
Jan 13 06:10:40 esx1 vmkernel: TSC: 36397712 cpu0:0)ACPI: 912: SRAT proc entry nodeID=0x01 apicID=0x12
Jan 13 06:10:40 esx1 vmkernel: TSC: 36400772 cpu0:0)ACPI: 912: SRAT proc entry nodeID=0x01 apicID=0x14
Jan 13 06:10:41 esx1 vmkernel: TSC: 36403752 cpu0:0)ACPI: 912: SRAT proc entry nodeID=0x01 apicID=0x16
Jan 13 06:10:41 esx1 vmkernel: TSC: 36406760 cpu0:0)ACPI: 912: SRAT proc entry nodeID=0x00 apicID=0x01
3. if you notice, both the restarts happened at 6:10 AM and the server unresponsiveness started at around 5:40AM. We recommend checking as to whether any power outages or maintenance activity happens at that point of time.
messages log
Jan 6 05:25:30 esx1 sfcb[4006]: IpmiIfcSelReserve: failed on send to node 0 with code 197
Jan 6 05:39:27 esx1 last message repeated 43 times
Jan 6 05:40:28 esx1 last message repeated 40 times
Jan 6 06:10:23 esx1 syslogd 1.4.1: restart.
Jan 6 06:10:23 esx1 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Jan 13 03:01:27 esx1 sfcb[29954]: INTERNAL StorelibManager::fireStorelibCommand - caller StorelibManager::getPartitionInfo, ProcessLibCommandCall failed, rval = 0x8023
Jan 13 03:01:28 esx1 cimslp: SLP data collection finished
Jan 13 06:10:37 esx1 syslogd 1.4.1: restart.
Jan 13 06:10:37 esx1 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Recommendations:
- The server BIOS release date is very old. Please upgrade the server BIOS version to the latest one.
- Upgrade the ESX version to 4.0 U4 since the current version is 4.0 GA.
- Increase the service console memory from 358 MB to 800 MB so the hostd memory is not in crunch.
- Engage the IBM vendor to do a sanity check on the hardware and check as to why the server goes in an unresponsive state.
- Also check with vendor if the firmware of motherboard or any other component can be upgraded to the latest,
Hostname: esx1.vmware.com
Version: VMware ESX 4.0.0 build-164009
Server Model: Specification Version: 2.0
Vendor: IBM Corp.
Version: -[D6E124AUS-1.01]-
Release Date: 04/30/2009
Service Console Mem (Cfg).............................358 Megabytes
Log Analysis:
1. We noticed that the server was abruptly rebooted at the hardware level based on the events generated in the vmksummary logs for both restarts. This can happen due to power outages, faulty components, and heating issues. To investigate further, engage the hardware vendor.
vmksummary logs
Jan 6 05:01:16 esx1 logger: (1325797276) hb: vmk loaded, 4454872.10, 4454865.986, 12, 164009, 164009, 96, vmware-h-95196, sfcbd-19696, sfcbd-13140
Jan 6 06:10:16 esx1 vmkhalt: (1325801416) Starting system...
Jan 6 06:11:04 esx1 logger: (1325801464) loaded VMkernel
Jan 6 07:01:15 esx1 logger: (1325804475) hb: vmk loaded, 3124.71, 3118.588, 0, 164009, 164009, 0, vmware-h-68388, sfcbd-12920, sfcbd-8044
Jan 13 05:01:13 esx1 logger: (1326402073) hb: vmk loaded, 600722.02, 600715.898, 12, 164009, 164009, 84, vmware-h-83112, sfcbd-13112, sfcbd-8212
Jan 13 06:10:30 esx1 vmkhalt: (1326406230) Starting system...
Jan 13 06:11:17 esx1 logger: (1326406277) loaded VMkernel
Jan 13 07:01:14 esx1 logger: (1326409274) hb: vmk loaded, 3109.67, 3103.543, 0, 164009, 164009, 0, vmware-h-68220, sfcbd-12964, sfcbd-8112
2. We do not see events being generated from 5:38 AM to 6:10 AM during the point of time which means that the host is unresponsive at the hardware level based on vmkernel logs.
vmkernel logs
Jan 6 05:38:32 esx1 vmkernel: 51:14:05:02.600 cpu2:20130)ScsiScan: 842: Path 'vmhba36:C0:T1:L0': Type: 0x10, ANSI rev: 3, TPGS: 1 (implicit only)
Jan 6 05:38:32 esx1 vmkernel: 51:14:05:02.600 cpu2:20130)ScsiScan: 105: Path 'vmhba36:C0:T1:L0': Peripheral qualifier 0x1 not supported
Jan 6 06:10:26 esx1 vmkernel: cpu0:0)ACPI: 912: SRAT proc entry nodeID=0x00 apicID=0x01
Jan 6 06:10:26 esx1 vmkernel: TSC: 34637036 cpu0:0)ACPI: 912: SRAT proc entry nodeID=0x00 apicID=0x03
Jan 6 06:10:26 esx1 vmkernel: TSC: 34640068 cpu0:0)ACPI: 912: SRAT proc entry nodeID=0x00 apicID=0x05
Jan 13 05:39:19 esx1 vmkernel: 6:23:30:02.601 cpu14:4188)ScsiScan: 105: Path 'vmhba36:C0:T1:L0': Peripheral qualifier 0x1 not supported
Jan 13 06:10:40 esx1 vmkernel: ID=0x10
Jan 13 06:10:40 esx1 vmkernel: TSC: 36397712 cpu0:0)ACPI: 912: SRAT proc entry nodeID=0x01 apicID=0x12
Jan 13 06:10:40 esx1 vmkernel: TSC: 36400772 cpu0:0)ACPI: 912: SRAT proc entry nodeID=0x01 apicID=0x14
Jan 13 06:10:41 esx1 vmkernel: TSC: 36403752 cpu0:0)ACPI: 912: SRAT proc entry nodeID=0x01 apicID=0x16
Jan 13 06:10:41 esx1 vmkernel: TSC: 36406760 cpu0:0)ACPI: 912: SRAT proc entry nodeID=0x00 apicID=0x01
3. if you notice, both the restarts happened at 6:10 AM and the server unresponsiveness started at around 5:40AM. We recommend checking as to whether any power outages or maintenance activity happens at that point of time.
messages log
Jan 6 05:25:30 esx1 sfcb[4006]: IpmiIfcSelReserve: failed on send to node 0 with code 197
Jan 6 05:39:27 esx1 last message repeated 43 times
Jan 6 05:40:28 esx1 last message repeated 40 times
Jan 6 06:10:23 esx1 syslogd 1.4.1: restart.
Jan 6 06:10:23 esx1 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Jan 13 03:01:27 esx1 sfcb[29954]: INTERNAL StorelibManager::fireStorelibCommand - caller StorelibManager::getPartitionInfo, ProcessLibCommandCall failed, rval = 0x8023
Jan 13 03:01:28 esx1 cimslp: SLP data collection finished
Jan 13 06:10:37 esx1 syslogd 1.4.1: restart.
Jan 13 06:10:37 esx1 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Recommendations:
- The server BIOS release date is very old. Please upgrade the server BIOS version to the latest one.
- Upgrade the ESX version to 4.0 U4 since the current version is 4.0 GA.
- Increase the service console memory from 358 MB to 800 MB so the hostd memory is not in crunch.
- Engage the IBM vendor to do a sanity check on the hardware and check as to why the server goes in an unresponsive state.
- Also check with vendor if the firmware of motherboard or any other component can be upgraded to the latest,