Chapter 4. Continuous availability and manageability 105
CRC
The bus transferring data between the processor and the memory uses CRC error
detection with a failed operation retry mechanism and the ability to retune bus parameters
dynamically when a fault occurs. In addition, the memory bus has spare capacity to
substitute a spare data bit-line that is determined to be faulty.
Chipkill
Chipkill is an enhancement that enables a system to sustain the failure of an entire DRAM
chip. Chipkill spreads the bit lines from a DRAM over multiple ECC words, so that a
catastrophic DRAM failure affects one bit in each word at most. The system can continue
indefinitely in this state with no performance degradation until the failed DIMM can be
replaced, assuming no additional single bit errors.
POWER7 memory subsystem
The POWER7 chip contains two memory controllers with four channels per memory
controller. The implementation on the PS701 and PS702 uses a single memory controller per
processor chip and four advanced memory buffer chips. Each memory buffer chip connects to
four memory DIMMs, 16 total per processor chip. The PS700 is similar, though it only uses
two memory buffer chips connecting to a total of eight DIMMs.
The bus transferring data between the processor and the memory uses CRC error detection
with a failed operation retry mechanism and the ability to retune bus parameters dynamically
when a fault occurs. In addition, the memory bus has spare capacity to substitute a spare
data bit-line for which is determined to be faulty.