Intel® Server System S7000FC4UR TPS BIOS Error Handling
Revision 1.0
225
The BIOS normally generates a NMI event in response to fatal and uncorrectable errors to
prevent continued system operation with corrupted data. Most operating systems halt the
system in response to NMI. However, certain Linux releases do not halt the system in response
to an NMI event and therefore do not provide effective containment of data corruption. BIOS
Setup provides an option to either reset the system or assert NMI in response to a PCI System
error. This option should be reconfigured to reset the system in response to fatal and
uncorrectable errors for these Linux releases.
19.2.2 Error Sources and Types
One of the major server management requirements is to correctly and consistently handle
system errors. System errors that can be enabled and disabled individually or as a group can be
categorized as:
Processor errors
Memory errors
Legacy PCI and PCI-X* errors
PCI Express* errors
Sensor events / errors
19.2.2.1 Processor Errors
The BIOS enables the error correction and detection capabilities of the processors by setting
appropriate bits in the processor Model Specific Register (MSR) set and the appropriate bits
inside the chipset.
In the case of unrecoverable errors on the host processor bus, proper execution of the
asynchronous SMM error handler cannot be guaranteed and the handler cannot be relied upon
to log such conditions. The handler records the error to the system event log only if the system
has not experienced a catastrophic failure that compromises the integrity of the handler.
19.2.2.1.1 Internal Error (IERR) and Thermal Trip
The BIOS contains no runtime handlers for processor IERR or thermal trip events. The system
relies on the BMC to detect and log these errors at runtime. The BIOS subsequently determines
the processor status during POST using the BMC Get Processor State command.
If the BMC reports either an IERR or thermal trip event on the previous boot, then the BIOS
displays an error message in the POST Error Manager and continues normal operation.
If a persistent status sensor needs to be cleared (such as the Thermal Trip sensor), the user
needs to select "Processor Retest" in the BIOS Setup utility Advanced | Processor page. The
BIOS then instructs the BMC to re-arm its sensors.