To Next Page

To Previous Page

4405ch04 Continuous availability and manageability.fmDraft Document for Review September 2, 2008 5:05 pm

102 IBM Power 570 Technical Overview and Introduction

In cases where the data cannot be recovered from another source, a technique called Special

Uncorrectable Error (SUE) handling is used to determine whether the corruption is truly a

threat to the system. If, as may sometimes be the case, the data is never actually used but is

simply over-written, then the error condition can safely be voided and the system will continue

to operate normally.

When an uncorrectable error is detected, the system modifies the associated ECC word,

thereby signaling to the rest of the system that the “standard” ECC is no longer valid. The

Service Processor is then notified, and takes appropriate actions. When running AIX V5.2 or

greater or Linux

and a process attempts to use the data, the OS is informed of the error and

terminates only the specific user program.

It is only in the case where the corrupt data is used by the POWER Hypervisor that the entire

system must be rebooted, thereby preserving overall system integrity.

Depending upon system configuration and source of the data, errors encountered during I/O

operations may not result in a machine check. Instead, the incorrect data is handled by the

processor host bridge (PHB) chip. When the PHB chip detects a problem it rejects the data,

preventing data being written to the I/O device. The PHB then enters a freeze mode halting

normal operations. Depending on the model and type of I/O being used, the freeze may

include the entire PHB chip, or simply a single bridge. This results in the loss of all I/O

operations that use the frozen hardware until a power-on reset of the PHB. The impact to

partition(s) depends on how the I/O is configured for redundancy. In a server configured for

fail-over availability, redundant adapters spanning multiple PHB chips could enable the

system to recover transparently, without partition loss.

4.2.3 Cache protection mechanisms

POWER6 processor-based systems are designed with cache protection mechanisms,

including cache line delete in both L2 and L3 arrays, Processor Instruction Retry and

Alternate Processor Recovery protection on L1-I and L1-D, and redundant “Repair” bits in

L1-I, L1-D, and L2 caches, as well as L2 and L3 directories.

L1 instruction and data array protection

The POWER6 processor’s instruction and data caches are protected against temporary

errors using the POWER6 Processor Instruction Retry feature and against solid failures by

Alternate Processor Recovery, both mentioned earlier. In addition, faults in the SLB array are

recoverable by the POWER Hypervisor.

L2 Array Protection

On a POWER6 processor-based system, the L2 cache is protected by ECC, which provides

single-bit error correction and double-bit error detection. Single-bit errors are corrected before

forwarding to the processor, and subsequently written back to L2. Like the other data caches

and main memory, uncorrectable errors are handled during run-time by the Special

Uncorrectable Error handling mechanism. Correctable cache errors are logged and if the

error reaches a threshold, a Dynamic Processor Deallocation event is initiated.

Starting with POWER6 processor-based systems, the L2 cache is further protected by

incorporating a dynamic cache line delete algorithm similar to the feature used in the L3

cache. Up to six L2 cache lines may be automatically deleted. It is not likely that deletion of a

few cache lines will adversely affect server performance. When six cache lines have been

repaired, the L2 is marked for persistent deconfiguration on subsequent system reboots until

it can be replaced.

SLES 10 SP1 or later, and in RHEL 4.5 or later (including RHEL 5.1).

Brand	IBM
Model	Power 570
Category	Server
Language	English

IBM Power 570 User Manual

Table of Contents

Questions and Answers:

IBM Power 570 Specifications

Related product manuals