Testing
and
Troubleshooting
3-15
Early Warning/Fault Tolerance
The computer provides early warning of several probable failures. These warnings enable the user
to schedule maintenance at
his
convenience, reducing downtime due to unexpected failure. Early
warning of machine failure
is
provided for overtemperature conditions
and
memory bit errors.
A battery assembly
on
the keyboard processor board protects the contents of the real-time clock
and non-volatile memory (RTC/NVM). The battery assembly
is
fault tolerant
in
that four batteries
comprise the assembly but the circuit requires only three batteries to maintain RTC/NVM data.
Overtemperature
The computer contains three dc box fans. One fan
is
in
the I/O card cage
and
operates at a single
speed whenever power
is
applied. The other two fans have three speeds
and
are associated with
the power supply
and
the processor stack.
The power supply
and
the processor stack CPU contain temperature sensors. The power supply
temperature sensor controls the two three-speed fans. When the temperature
in
the power supply
rises above
39°C, the power supply steps both fans from low to medium speed. When the tempera-
ture rises above
51°C, the fans are stepped from medium to high speed. When high speed
is
required for proper cooling, a message
is
issued to the user, providing notice that shutdown
is
imminent
if
temperature increase continues.
If the temperature at the power supply sensor exceeds 97°C or the processor stack CPU sensor
senses a temperature greater than
100°C, the power supply shuts down
and
one
of the over-
temperature
LEOs
on
the power supply lights (STACK TEMP or
SEC
BOARD). These LEOs are
visible by removing the front cover.
Memory Errors
The processor stack memory controller chip detects
all
single
and
double-bit
RAM
failures
and
corrects single-bit failures. These detection
and
correction procedures are done at run time.
When a double-bit or greater failure
is
detected, the CPU
is
notified
and
the entire system halts. A
message
is
issued indicating which memory
fin
strate has failed.
When a single-bit failure
is
detected, the failure can
be
corrected
and
healed by pointing future
accesses of that location to a location
in
the healer
RAM
of the memory controller chip. Each
memory controller has
32
locations reserved for healing of
RAM.
When
all
32
locations have
been
used, the CPU
is
notified that the healer
is
full.
The operating system then tests each of the healed
locations to determine
if
that location
is
still
faulty or
if
a soft error caused the failure. Overflowed
healer CAMs can be cleared
and
reused by the operating system.