4405ch04 Continuous availability and manageability.fmDraft Document for Review September 2, 2008 5:05 pm
102 IBM Power 570 Technical Overview and Introduction
In cases where the data cannot be recovered from another source, a technique called Special
Uncorrectable Error (SUE) handling is used to determine whether the corruption is truly a
threat to the system. If, as may sometimes be the case, the data is never actually used but is
simply over-written, then the error condition can safely be voided and the system will continue
to operate normally.
When an uncorrectable error is detected, the system modifies the associated ECC word,
thereby signaling to the rest of the system that the “standard” ECC is no longer valid. The
Service Processor is then notified, and takes appropriate actions. When running AIX V5.2 or
greater or Linux
1
and a process attempts to use the data, the OS is informed of the error and
terminates only the specific user program.
It is only in the case where the corrupt data is used by the POWER Hypervisor that the entire
system must be rebooted, thereby preserving overall system integrity.
Depending upon system configuration and source of the data, errors encountered during I/O
operations may not result in a machine check. Instead, the incorrect data is handled by the
processor host bridge (PHB) chip. When the PHB chip detects a problem it rejects the data,
preventing data being written to the I/O device. The PHB then enters a freeze mode halting
normal operations. Depending on the model and type of I/O being used, the freeze may
include the entire PHB chip, or simply a single bridge. This results in the loss of all I/O
operations that use the frozen hardware until a power-on reset of the PHB. The impact to
partition(s) depends on how the I/O is configured for redundancy. In a server configured for
fail-over availability, redundant adapters spanning multiple PHB chips could enable the
system to recover transparently, without partition loss.
4.2.3 Cache protection mechanisms
POWER6 processor-based systems are designed with cache protection mechanisms,
including cache line delete in both L2 and L3 arrays, Processor Instruction Retry and
Alternate Processor Recovery protection on L1-I and L1-D, and redundant “Repair” bits in
L1-I, L1-D, and L2 caches, as well as L2 and L3 directories.
L1 instruction and data array protection
The POWER6 processor’s instruction and data caches are protected against temporary
errors using the POWER6 Processor Instruction Retry feature and against solid failures by
Alternate Processor Recovery, both mentioned earlier. In addition, faults in the SLB array are
recoverable by the POWER Hypervisor.
L2 Array Protection
On a POWER6 processor-based system, the L2 cache is protected by ECC, which provides
single-bit error correction and double-bit error detection. Single-bit errors are corrected before
forwarding to the processor, and subsequently written back to L2. Like the other data caches
and main memory, uncorrectable errors are handled during run-time by the Special
Uncorrectable Error handling mechanism. Correctable cache errors are logged and if the
error reaches a threshold, a Dynamic Processor Deallocation event is initiated.
Starting with POWER6 processor-based systems, the L2 cache is further protected by
incorporating a dynamic cache line delete algorithm similar to the feature used in the L3
cache. Up to six L2 cache lines may be automatically deleted. It is not likely that deletion of a
few cache lines will adversely affect server performance. When six cache lines have been
repaired, the L2 is marked for persistent deconfiguration on subsequent system reboots until
it can be replaced.
1
SLES 10 SP1 or later, and in RHEL 4.5 or later (including RHEL 5.1).