Recovery
Issue 1 May 2002
3-7555-233-143
Watchdog’s Hardware Timer
The Watchdog’s HiMonitor resets the timer on the hardware Watchdog circuitry
via the Hardware-Sanity device driver. If the Watchdog is unable to perform that
task, then the timer’s value eventually decrements to 0, and the processor is
reset.
Hardware-Sanity Device Driver
The Hardware Sanity device driver (loadable module) is a modified Linux driver
for the hardware Watchdog. A Sanity thread periodically writes to the Hardware
Sanity driver, which resets the timer on the hardware Watchdog. If the Sanity
thread does not write to the Hardware-Sanity driver, the:
■ Driver does not reset the timer on the hardware Watchdog
■ Timer expires
■ Hardware Watchdog reboots Linux
The driver has three capabilities: set time-out interval to some (configurable)
value, reset the timer to the time-out interval, and reboot Linux.
Rolling Reboots
There may be cases where recovering the system using a reboot does not correct
the problem. If this occurs, the server continually reboots. This repeated
rebooting increases the difficulty of diagnosing the problem. The Watchdog
handles this with “MaxReboots” and “MaxRebootInterval” parameters in the
watchd.conf file. (The default values are currently set to 3 reboots within 60
minutes.) Watchdog logs a message to syslog and does not start any processes,
if it detects the software is rebooting too quickly. When running in this mode,
Watchdog’s sole purpose is to reset the hardware Watchdog.
Restarts
The term “restart” is a traditional Avaya term for a system restart of less severity
than a full recreation. Restarts were accomplished by retaining the memory state
of certain processes.
The WatchDog process is not restartable, nor can it invoke restarts in
MultiVantage. In addition, none of the other Watchdog-started applications can
restart. (They are reloaded, as previously described). If the Watchdog itself dies,
the parent Watchdog process restarts it. If it repeatedly dies (10 times in 2
minutes), init logs a message to Syslog for the GMM to process. GMM lowers the
SOH which causes a server interchange. Eventually, the hardware Watchdog
resets the processor since Watchdog is no longer resetting the hardware
Watchdog.