EasyManua.ls Logo

Intel S1400FP User Manual

Intel S1400FP
141 pages
To Next Page IconTo Next Page
To Next Page IconTo Next Page
Page #1 background imageLoading...
Page #1 background image
System Event Log Troubleshooting
Guide for EPSD Platforms Based on
Intel
®
Xeon
®
Processor E5
4600/2600/2400/1600/1400
Product Families
Intel order number G90620-002
Revision 1.1
September 2013
Enterprise Platforms and Services Division Marketing

Table of Contents

Question and Answer IconNeed help?

Do you have a question about the Intel S1400FP and is the answer not in the manual?

Intel S1400FP Specifications

General IconGeneral
Product TypeServer Motherboard
Form FactorSSI EEB
ChipsetIntel C602
Supported Memory TypesDDR3
NetworkingIntel 82574L Gigabit Ethernet
SATA2 x SATA 6Gb/s
Storage InterfacesSATA
USB Ports4 x USB 2.0
Expansion Slots2 x PCIe 3.0 x16

Summary

1. Introduction

1.1 Purpose

Lists all possible events generated by the Intel platform, excluding external sources.

1.2 Industry Standard

Details industry standards like IPMI, BMC, and Intel Intelligent Power Node Manager for server management.

2. Basic Decoding of a SEL Record

2.1 Default Values in SEL Records

Explains default values for SEL records, including Record Type, Generator ID, and Event Message Revision.

2.2 Notes on SEL Logs and Collecting SEL Information

Provides guidance on capturing SEL logs, including human-readable and hex versions, and handling OEM-specific data.

2.2.1 Examples of Decoding BIOS Timestamp Events

Illustrates how to decode BIOS timestamp events logged during POST and OS shutdown.

2.2.2 Example of Decoding a PCI Express Correctable Error Event

Demonstrates decoding a PCI Express correctable error event, including bus, device, and function details.

2.2.3 Example of Decoding a Power Supply Predictive Failure Event

Shows how to decode a power supply predictive failure event, detailing voltage warnings and fault events.

3. Sensor Cross Reference List

3.1 BMC owned Sensors (GID = 0020h)

Provides details for sensors owned by the Baseboard Management Controller (BMC).

3.2 BIOS POST owned Sensors (GID = 0001h)

Lists details for sensors owned by BIOS POST, including memory RAS and POST errors.

3.3 BIOS SMI Handler owned Sensors (GID = 0033h)

Lists details for sensors owned by the BIOS SMI Handler, covering PCI and memory errors.

3.4 Node Manager / ME Firmware owned Sensors (GID = 002Ch or 602Ch)

Details sensors managed by the Node Manager/Management Engine (ME) firmware.

3.5 Microsoft OS owned Events (GID = 0041)

Lists records generated by the Microsoft Operating System (OS), including boot and shutdown events.

3.6 Linux Kernel Panic Events (GID = 0021)

Details records generated by Linux kernel panics, providing information for troubleshooting.

4. Power Subsystems

4.1 Threshold-based Voltage Sensors

Describes BMC monitoring of main voltage sources, including typical characteristics and event triggers.

4.2 Voltage Regulator Watchdog Timer Sensor

Explains the BMC monitoring of board VR controller power sequence and VR Watchdog Timeout.

4.3 Power Unit

Details how the power unit monitors system state and logs changes in the SEL.

4.4 Power Supply

Covers BMC monitoring of power supply subsystem status, power input, current output, and temperature.

4.4.1 Power Supply Status Sensors

Reports the status of power supplies in the system, logging events for failures and configuration errors.

4.4.2 Power Supply Power In Sensors

Logs events when a power supply exceeds its AC power in threshold, indicating potential over-consumption.

4.4.3 Power Supply Current Out % Sensors

Monitors current output of the 12v rail as a percentage of maximum output, logging events for over-consumption.

4.4.4 Power Supply Temperature Sensors

Describes BMC monitoring of power supply temperatures and event triggers for overheating conditions.

4.4.5 Power Supply Fan Tachometer Sensors

Details BMC polling of power supply fan status to check for failure conditions and log events.

5. Cooling Subsystem

5.1 Fan Sensors

Covers fan speed, presence, and redundancy sensors, detailing their typical characteristics and events.

5.1.1 Fan Tachometer Sensors

Monitors fan RPM signals, logging events when fans spin too slowly or fail.

5.1.2 Fan Presence and Redundancy Sensors

Details fan presence and redundancy sensors, used for hot-swap fans and warning of redundancy loss.

5.2 Temperature Sensors

Covers various temperature sensors: threshold-based, thermal margin, processor control, DTS, discrete, and DIMM trip.

5.2.1 Threshold-based Temperature Sensors

Describes linear sensors reporting actual temperature, used for front panel and baseboard monitoring.

5.2.2 Thermal Margin Sensors

Explains linear sensors reporting offset from critical temperature, used for fan control and DIMM grouping.

5.2.3 Processor Thermal Control Sensors

Monitors processor time constrained by thermal management, indicating potential overheating.

5.2.4 Processor DTS Thermal Margin Sensors

Details DTS-based thermal sensors for accurate thermal solution control, used as input for fan algorithms.

5.2.5 Discrete Thermal Sensors

Reports specific overheating events like VRD Hot or Processor Thermal Trip, indicating system shutdown triggers.

5.2.6 DIMM Thermal Trip Sensors

Monitors DIMM thermal trip events, causing automatic server power down and SEL logging.

5.3 System Air Flow Monitoring Sensor

Reports volumetric system airflow in CFM, calculated from system fan PWM values for data center thermal management.

6. Processor Subsystem

6.1 Processor Status Sensor

Monitors status information for each processor slot, logging events for asserted states until reset.

6.2 Catastrophic Error Sensor

Detects asserted CATERR# signal, indicating serious hardware issues, and logs events for BMC monitoring.

6.3 CPU Missing Sensor

Reports when a processor is not installed, often due to incorrect socket population.

6.4 Quick Path Interconnect Sensors

Monitors the QPI bus interconnect between processors for link width reduction and error conditions.

6.4.1 QPI Link Width Reduced Sensor

Logs events when BIOS POST reduces QPI Link Width due to initialization errors.

6.4.2 QPI Correctable Error Sensor

Logs informational events for corrected QPI errors, indicating acceptable occurrence at low rates.

6.4.3 QPI Fatal Error and Fatal Error #2

Detects and logs QPI fatal or non-recoverable errors, which are critical for system stability.

6.5 Processor ERR2 Timeout Sensor

Monitors CPU's ERR2 signal assertion duration, logging events for unrecoverable fatal errors.

6.6 Processor MSID Mismatch Sensor

Monitors MSID mismatch faults indicating power rating incompatibility between baseboard and processor.

7. Memory Subsystem

7.1 Memory RAS Configuration Status

Logs memory RAS configuration status after AC power-on or during POST for configuration errors.

7.2 Memory RAS Mode Select

Records changes in RAS Mode, logging previous and selected modes for Spare Channel mode status.

7.3 Mirroring Redundancy State

Logs events when Mirroring Mode loses redundancy due to Uncorrectable ECC Errors.

7.4 Sparing Redundancy State

Logs events when Sparing Mode loses redundancy due to Correctable ECC Errors crossing a threshold.

7.5 ECC and Address Parity

Covers memory data errors (correctable/uncorrectable) and address parity errors, which are fatal.

7.5.1 Memory Correctable and Uncorrectable ECC Error

Details ECC errors divided into correctable and uncorrectable types, identifying failing DIMM modules.

7.5.2 Memory Address Parity Error

Logs address parity errors affecting memory addressing, treated similarly to uncorrectable ECC errors.

8. PCI Express* and Legacy PCI Subsystem

8.1 PCI Express* Errors

Defines standard error types for PCI Express (AER) and Legacy PCI (PERR, SERR) logged to SEL.

8.1.1 Legacy PCI Errors

Covers PERR and SERR errors, which are fatal errors for Legacy PCI.

8.1.2 PCI Express* Fatal Errors and Fatal Error #2

Details PCI Express fatal errors reported to BIOS SMI handler, including error format and continuation.

8.1.3 PCI Express* Correctable Errors

Describes PCI Express correctable errors logged by BIOS SMI handler, considered informational.

9. System BIOS Events

9.1 System Events

Covers events occurring during POST or sleep state, including BIOS POST and SMI Handler events.

9.1.1 System Boot

Logs a System Boot Event at the end of POST, marking the transition to OS Loader.

9.1.2 Timestamp Clock Synchronization

Details events for synchronizing time between BIOS and BMC for accurate log timestamps.

9.2 System Firmware Progress (Formerly Post Error)

Logs POST errors to SEL, providing information on what caused the error, which may not be fatal.

10. Chassis Subsystem

10.1 Physical Security

Covers chassis intrusion and LAN leash lost sensors, monitoring physical security and network connection.

10.1.1 Chassis Intrusion

Monitors chassis intrusion on supported chassis, logging events when the chassis lid is opened or closed.

10.1.2 LAN Leash Lost

Logs events when the network port loses physical connection, indicating a LAN leash lost condition.

10.2 FP (NMI) Interrupt

Logs events when a diagnostic interrupt is generated, for example, by the front panel NMI button.

10.3 Button Sensor

Logs front panel power and reset button presses for informational purposes, not indicating errors.

11. Miscellaneous Events

11.1 IPMI Watchdog

Describes IPMI watchdog timer for checking OS responsiveness and BMC actions on timer expiry.

11.2 SMI Timeout

Explains SMI timeout interrupts that can freeze the system, triggering BMC reset after logging.

11.3 System Event Log Cleared

Logs a SEL clear event, indicating manual or factory clearing of the System Event Log.

11.4 System Event – PEF Action

Details Platform Event Filters (PEF) for sending alerts on logged events, requiring user configuration.

11.5 BMC Watchdog Sensor

Reports BMC reset events due to BMC Watchdog feature actions, logging FW stack or CPU resets.

11.6 BMC FW Health Sensor

Tracks BMC sensor health, reporting failures for consecutive sensor errors or HAL errors.

11.7 Firmware Update Status Sensor

Generates SEL events related to BMC, BIOS, and ME firmware updates, only for assertion events.

11.8 Add-In Module Presence Sensor

Indicates whether add-in modules/boards are installed in dedicated slots on server boards.

11.9 Intel Xeon Phi Coprocessor Management Sensors

Details limited manageability of Intel Xeon Phi Coprocessor adapters, covering thermal margin and status sensors.

12. Hot-Swap Controller Backplane Events

12.1 HSC Backplane Temperature Sensor

Measures ambient temperature on the Hot-Swap Backplane, logging events for threshold breaches.

12.2 Hard Disk Drive Monitoring Sensor

Monitors HDD status through disk status sensors owned by the BMC, supporting multiple storage backplanes.

12.3 Hot-Swap Controller Health Sensor

Indicates HSC health, reporting offline or degraded states due to communication issues or firmware.

13. Manageability Engine (ME) Events

13.1 ME Firmware Health Event

Reports ME firmware health, including upgrade and application errors, via Platform Event messages.

13.2 Node Manager Exception Event

Logs events when maintained policy power limit is exceeded over the Correction Time Limit.

13.3 Node Manager Health Event

Provides runtime error indications about Intel Intelligent Power Node Manager's health and services.

13.4 Node Manager Operational Capabilities Change

Indicates changes in Node Manager's operational capabilities, such as policy interface and monitoring.

13.5 Node Manger Alert Threshold Exceeded

Logs events when maintained policy power limit is exceeded over Correction Time Limit.

14. Microsoft Windows* Records

14.1 Boot up Event Records

Logs boot-up and OEM events when the system boots into Microsoft Windows OS.

14.2 Shutdown Event Records

Records OS Stop/Shutdown events, followed by OEM records for shutdown reason and comment.

14.3 Bug Check / Blue Screen Event Records

Logs Bug Check/Blue Screen events, including OS Stop/Shutdown and OEM code records for failure analysis.

15. Linux* Kernel Panic Records

Related product manuals