Watcher - Abstract

jtruitt@dw3f.ess.harris.com
Fri, 05 Aug 94 11:49:39 -0400

                  Keeping watch over the flocks
                       at night (and day)


                         Kenneth Ingham
            University of New Mexico Computing Center
                    Distributed Systems Group
                         2701 Campus NE
                      Albuquerque, NM 87131
                         (505) 277-8044
                      ingham@charon.unm.edu
                   ucbvax!unmvax!charon!ingham

     Topic Areas: Applications, System management, Utilities



The computing facilities offered by the University of New  Mexico
Computing  Center include three microvaxen, five large vaxen (780
or bigger), and a Sequent B8000.  In addition to  these  Unix/VMS
machines,  the  UNMCC  Distributed Systems Group (DSG) monitors a
number of the various microvaxen and sun  workstations  scattered
across  campus.  This duty falls to the DSG Programmer designated
as "DOC", or "DSG On Call", who receives his beeper  based  on  a
monthly rotation schedule.

In the past, shell scripts running every six hours reported vari-
ous  system  statistics  to  DOC, who then scanned the output for
signs of possible trouble.  As the number  of  machines  and  the
number  of  potential problems grew, the mound of output that DOC
had to process, most of  which  merely  indicated  normal  system
operation,  became  overwhelming.   Now, with several machines to
monitor and only one person acting  in  this  capacity,  DOC  can
often  waste  a  tremendous  amount of time wading through system
status reports, time which can be better  spent  actually  fixing
system problems.

In response to this situation, the author developed a tool  which
introduces  some  intelligence into the machine's self-reporting,
letting the machine filter out messages indicating normal  opera-
tion  and  forwarding  to DOC only those messages which point out
trouble areas.  The result of these efforts is  Watcher,  a  very
general  and  extensible system self-monitor.  Running more often
than the set of shell scripts, Watcher keeps closer tabs  on  the
system;  since  it delivers only a summary of potential problems,
however, this extra monitoring produces _n_o corresponding increase
in  the  demand on the system manager.  No problems slip by unno-
ticed in the more concise output, leading to  an  improvement  in
overall  system availability as well as the more effective utili-
zation of the system manager's time.

Watcher was designed to be almost as flexible as DOC in  deciding
what constitutes a problem with the system.  Running at intervals
specified in crontab, Watcher issues a number  of  user-specified
commands  (each  of which delivers its output in a different for-
mat), parsing all or part of the output from either the  left  or
the  right.   It  compares this to the last such output obtained,
checking for indications of a  system  abnormality.   Such  signs
might  take  the  form  of a too abrupt change in a certain value
(e.g. a process which suddenly begins gobbling  vast  amounts  of
cpu time), a value which exceeds the allowable maximum or minimum
(such as a an overly-full file system), or an unacceptable change
in  a  string value (e.g. when "up" changes to "down").  For com-
mands such as "ps" whose output  varies  considerably  with  each
run,  specific  parts  of  the output can be designated as a key;
successive runs of Watcher will home in on these  key  areas  for
their comparisons.

Since the user specifies not only the commands Watcher will  exe-
cute  and  the  time  lapse between successive runs, but also the
aforementioned parameters which indicate system anomalies, Watch-
er can easily be seen as a very flexible, general system monitor.
Its use at UNM has provided a marked increase in the productivity
of  the  system manager, which has led in turn to the increase in
the reliability and availability of the systems at UNMCC.