Watcher - Paper

jtruitt@dw3f.ess.harris.com
Fri, 05 Aug 94 11:50:41 -0400

Kenneth Ingham University of New Mexico Computing Center  Distri-
buted  Systems  Group  2701 Campus NE Albuquerque, NM 87131 (505)
277-8044  ingham@charon.unm.edu   or  ucbvax!unmvax!charon!ingham
Over the last several years, the number of machines maintained by
the University of New Mexico Computing Center has  increased  ra-
pidly, yet the number of system managers monitoring these systems
has remained static.   Consequently,  the  system  managers  were
faced  with  the  task  of watching more and more machines; since
only one system manager is on call  at  any  time  (known  affec-
tionately  as  "DOC"), this soon proved to be an unacceptable si-
tuation.  Shell scripts running every six hours gave some  assis-
tance;  this  was offset by the fact that the scripts generated a
great deal of output indicating normal  system  operation,  which
the system manager still had to scan carefully for signs of trou-
ble.  This paper describes _w_a_t_c_h_e_r,  a  flexible  system  monitor
which  watches  the  system  more  closely  than the human system
manager while generating less output for him to examine.

Running more often than the above mentioned set of shell scripts,
_w_a_t_c_h_e_r  is  able  to  keep  closer  tabs on the system; since it
delivers only a list of potential problems, however,  this  extra
monitoring  produces  _n_o  corresponding increase in the demand on
DOC.  No problems slip by unnoticed in the more  concise  output,
leading  to an improvement in overall system availability as well
as the more effective utilization of the system  manager's  time.
I  would  like to thank Leslie Gorsline for her assistance in the
writing of this paper.  Without her, this paper  might  not  have
been.   Also  thanks  to  the UNMCC distributed systems group for
their comments that helped improve _w_a_t_c_h_e_r.  The computing facil-
ities  offered  by  the University of New Mexico Computing Center
(UNMCC) include  three  microvaxen,  five  large  vaxen  (780  or
bigger),  and  a  Sequent  B8000.   In addition to these Unix/VMS
machines, the UNMCC Distributed Systems Group  (DSG)  monitors  a
number  of  the various microvaxen and sun workstations scattered
across campus.  This duty falls to the DSG Programmer  designated
as  "DOC",  or  "DSG On Call", who receives his beeper based on a
monthly rotation schedule.

In the past, shell scripts running every six hours reported vari-
ous  system  statistics  to  DOC, who then scanned the output for
signs of possible trouble.  The output of these shell scripts be-
came  overwhelming  as the number of machines and potential prob-
lems grew; corresponding to this increase in output  was  an  in-
crease  in  the amount of time that DOC had to spend reading this
output.  In addition, most of this output merely indicated normal
system  operation;  potential  problems  were buried amongst non-
problems.  Because of this, DOC could often  waste  a  tremendous
amount  of  time wading through system status reports, time which
can be better spent actually fixing system problems.

Unix is equipped with many powerful tools  for  program  develop-
ment,  but  none which simply watch the system for signs of trou-
ble.  Programs like _p_s and _d_f provide information  regarding  the
current  state of the machine, yet it still remains DOC's respon-
sibility to interpret this information and assess the  health  of
the  system  at any given time.  This deficiency can be rectified
by providing the system with the capacity to  determine  its  own
state of health, advising DOC when it notices a problem which re-
quires DOC's intervention.   In  designing  _w_a_t_c_h_e_r,  the  author
closely  examined  just  what  DOC does in monitoring the system;
just how _d_o_e_s DOC spot potential  trouble  in  the  DOC  reports?
These  reports  consist  of output from _d_f -_i,
_r_u_p_t_i_m_e, _p_s -_a_u_x |
_s_o_r_t, and the tail of _c_r_o_n_l_o_g, which usually only
changes in  the
middle of the night.  It was determined that DOC's task consisted
primarily of scanning various numbers in  this  output,  deciding
whether or not they had exceeded an allowable maximum or minimum,
or if the values had changed too much from the last time the com-
mand  was  run, assuming the last value is even remembered.  Get-
ting a computer to do this is more complicated than might seem at
first glance, due to inconsistencies in the location of pertinent
information between runs of these commands.   For  instance,  the
process occupying the fifth line of _p_s -_a_x might next time appear
on the eighth line; similarly, _u_p_t_i_m_e does not  consistently  put
germane information in the same place on the line.

While flexibility is certainly a primary design consideration, it
is not the whole story.  In order to improve DOC's effectiveness,
the program should run frequently, roughly  every  two  or  three
hours,  catching  problems  early (hopefully before they have af-
fected the users).  Thus, the program should also be as silent as
possible  except  when it detects a potential problem; any advan-
tage DOC gains in using _w_a_t_c_h_e_r would be eliminated if  the  pro-
gram  delivered  an  exceedingly  verbose status report every two
hours.  _w_a_t_c_h_e_r's problem reports should be  exact  and  concise,
leading DOC immediately to the trouble.

The problem of reducing the amount of output DOC must process can
be  approached  in  different ways, including the redesign of the
current shell scripts.  A simple _a_w_k script can watch the  output
from  _d_f  [1].   However,  each  command  would  require a custom
tailored _a_w_k script to look at it.  This task grows more  compli-
cated  as the number of programs running increases.  While a pro-
gram could be written to generate these _a_w_k scripts, this process
is  needlessly  complex; for only a bit more work, an efficient C
program such as _w_a_t_c_h_e_r can be developed. Run at intervals speci-
fied  in _c_r_o_n_t_a_b, _w_a_t_c_h_e_r parses a control
file (./_w_a_t_c_h_e_r_f_i_l_e by
default) with a _y_a_c_c generated parser, building a data  structure
containing  all  of the information from the file.  The file con-
tains the list of commands _w_a_t_c_h_e_r  should  run  (the  pipeline),
output  specifications  for each command (the output format), and
the guidelines used in determining  if  something  is  amiss  and
should  be reported to DOC (the change format).  A sample _w_a_t_c_h_e_r
control file would look something like this (comment lines  begin
with  a  '#'):  #  Here  is  the pipeline and its alias: (df -i |
/usr/ucb/tail +2) { df } # the output format; this  is  a  column
output  format:          $1-9  device%k $41-42 spaceused%d $64-65
inodesused%d: # and the change format:                  spaceused
15%;                  spaceused  0 89;                 inodesused
15%;                 inodesused 0 49.

# another command example: (/usr/ucb/ruptime | fgrep -f UnmHosts)
{ ruptime } # this is a relative output format         2 status%s
1  machine%k  7  loadav%d:   #   and   another   change   format:
                loadav  0  10;                  status "up".  The
first entry causes _w_a_t_c_h_e_r to  run  the  _d_f  pipeline  listed  in
parentheses.   When  reporting  problems,  _w_a_t_c_h_e_r refers to this
command by the alias provided in the braces; if no alias appears,
_w_a_t_c_h_e_r uses the entire pipeline.

The output format instructs _w_a_t_c_h_e_r  how  to  parse  the  output;
column  format,  indicated  in  the output format by num-num, in-
structs _w_a_t_c_h_e_r that the output  should  be  parsed  by  columns,
while  relative  format,  denoted by a single integer, shows that
the output should be broken up by whitespaces.  Through the  con-
vention name%type, the output format also names each field, indi-
cating whether the field is numeric, string, or  keyword,  speci-
fied  by  d,  s,  or  k respectively.  Keyword fields are used to
match  up  corresponding  output  lines   between   runs.    Thus
        41-42  spaceused%d  indicates that this field, named spa-
ceused, contains numeric  information  in  columns  41-42,  while
        2 status%s informs _w_a_t_c_h_e_r that the second word (group of
non-whitespace characters) on the line is a  string  field  named
status.   For  the  _d_f  example given above, Filesystem    kbytes
used    avail   capacity    iused    ifree   %iused   Mounted  on
/dev/hp1f       52431    39763     7424     84%     6937     9447
42%   /develop device would be /_d_e_v/_h_p_1_f, spaceused would be  84,
and  inodesused would be 42.  Similarly, the output from the _r_u_p_-
_t_i_m_e example, which looks like this  charon         up  26+07:53,
17 users,  load 3.12, 2.90, 2.66 would be broken at the following
places: charon | up | 26+07:53, | 17 | users, | load  |  3.12,  |
2.90, | 2.66, assigning "up" to status, and 3.12 to loadav.

The name field also appears in the change format, designating al-
lowable  values  for  this  field  to  have.  These values can be
specified as single character  strings  in  the  case  of  string
fields;  in  the case of numeric fields, the values take the form
of either percentage or absolute changes, or a minimum  and  max-
imum  which  delineate  an  acceptable  range.  Thus         ino-
desused 15%;         inodesused 0 49.  signifies that DOC  should
be  notified if the field named inodesused increases by more than
15% from the last run, or if it is outside the  range  0  to  49;
similarly          status  "up"; informs _w_a_t_c_h_e_r to notify DOC if
the status field contains anything other than the word "up".

As _w_a_t_c_h_e_r parses the output of a pipeline, it  stores  the  per-
tinent  parts  of  the  output  in  a  history  file (by default,
./_w_a_t_c_h_e_r._h_i_s_t_o_r_y).  The next time
_w_a_t_c_h_e_r runs,  it  reads  this
file  to provide comparison values for the command.  If a command
is new (i.e. it has no previously-stored output  in  the  history
file),  _w_a_t_c_h_e_r checks the fields which require no previous data,
such as min-max fields, while still storing _a_l_l of  the  relevant
information  to  the  history  file.  Thus, the next time the new
command is run,  it  will  be  an  _o_l_d  command,  and  meaningful
between-run comparisons can be made.

When _w_a_t_c_h_e_r detects no problems with the system, DOC receives an
empty  mail message with the subject "_h_o_s_t_n_a_m_e had no problems at
_d_a_t_e"; this is to insure that _m_a_i_l is running correctly. When  it
notices  a problem which should be brought to DOC's attention, it
mails the system problem report in a concise  format,  explaining
what  is  wrong and why. Thus, rather than the megabytes of shell
script output that DOC used to receive and have to read, he mere-
ly  sees  this  when he reads his mail: Mail version 5.2 6/21/85.
Type ? for help.  "/usr/spool/mail/ingham": 5 messages 5 new
 N  1 root@charon.unm Sat Apr 11 16:00   8/212   "charon  had  no
problems at Sat"
 N  2 root@ariel.unm Sat Apr 11 16:00  8/208  "ariel had no prob-
lems at Sat "
 N  3 root@geinah.unm Sat Apr 11 16:00   11/417  "System  problem
report for gei"
 N  4 root@izar.unm Sat Apr 11 16:00  8/204  "izar had  no  prob-
lems at Sat A"
 N  5 root@deimos.unm Sat Apr 11 16:00   8/212   "deimos  had  no
problems  at  Sat"  The letters indicating no problems can be im-
mediately deleted, and DOC can turn his attention to  the  letter
indicating a system problems.  A sample problem report would look
something like this:  df  has  a  max/min  value  out  of  range:
/dev/hp0h      140488   111195    15244     91%    10145    28767
26%   /usr where spaceused = 91.00; valid range  0.00  to  89.00.
Also  it  had inodesused change by more than 10%.  Previous value
20.00; current value 26.00.  Note that if a line  has  more  than
one  indication  of  a problem, all anomalies are included in the
report.  This provides DOC with as much information as  possible,
allowing  him to determine the problem quickly and devise a rapid
fix (hopefully before users know something is amiss).

_w_a_t_c_h_e_r's primary advantage lies in the reduction of  DOC's  work
load.   It has taken over the more menial aspects of monitoring a
system, tasks like reading and comparing numbers, giving DOC more
time  to  concentrate on bugs of a nature which _w_a_t_c_h_e_r isn't set
up to monitor, such as problems in the accounting system.  DOC is
apprised  of  potential  problems  quickly, and in some cases can
repair them in less time than simply  reading  the  shell  script
output would have taken.

The ability to monitor changes between runs has also helped bring
to  our  attention some problems which were missed in the DOC re-
ports.  For example, disk space on /_u_2 on  one  of  our  machines
jumped by more than 15%.  Since this jump did not force the total
space used above 90%, at which point DOC would have  investigated
the  filesystem,  it is unlikely that DOC would have even noticed
this sudden change.  The facility to watch for  relative  changes
between  runs enables DOC to catch problems in their infancy, and
fix problems such as filesystems filling up  too  rapidly  before
they inconvenience the users.

Since the system manager specifies not only the commands  _w_a_t_c_h_e_r
will execute and the time lapse between successive runs, but also
the parameters which indicate system anomalies, _w_a_t_c_h_e_r can easi-
ly  be  seen as a very flexible, general system monitor.  Its use
at UNM has provided an increase in the productivity of the system
manager, which has led in turn to the increase in the reliability
and availability of the systems at UNMCC.  _w_a_t_c_h_e_r will  be  sent
to the moderator of mod.sources after the conference is over.
[1] Monitoring Free Disk Space,  Rik  Farrow,  Wizard's  Grabbag,
     _U_n_i_x _W_o_r_l_d, Vol. IV, no. 3, pp. 86-87.