Kenneth Ingham University of New Mexico Computing Center Distri- buted Systems Group 2701 Campus NE Albuquerque, NM 87131 (505) 277-8044 ingham@charon.unm.edu or ucbvax!unmvax!charon!ingham Over the last several years, the number of machines maintained by the University of New Mexico Computing Center has increased ra- pidly, yet the number of system managers monitoring these systems has remained static. Consequently, the system managers were faced with the task of watching more and more machines; since only one system manager is on call at any time (known affec- tionately as "DOC"), this soon proved to be an unacceptable si- tuation. Shell scripts running every six hours gave some assis- tance; this was offset by the fact that the scripts generated a great deal of output indicating normal system operation, which the system manager still had to scan carefully for signs of trou- ble. This paper describes _w_a_t_c_h_e_r, a flexible system monitor which watches the system more closely than the human system manager while generating less output for him to examine. Running more often than the above mentioned set of shell scripts, _w_a_t_c_h_e_r is able to keep closer tabs on the system; since it delivers only a list of potential problems, however, this extra monitoring produces _n_o corresponding increase in the demand on DOC. No problems slip by unnoticed in the more concise output, leading to an improvement in overall system availability as well as the more effective utilization of the system manager's time. I would like to thank Leslie Gorsline for her assistance in the writing of this paper. Without her, this paper might not have been. Also thanks to the UNMCC distributed systems group for their comments that helped improve _w_a_t_c_h_e_r. The computing facil- ities offered by the University of New Mexico Computing Center (UNMCC) include three microvaxen, five large vaxen (780 or bigger), and a Sequent B8000. In addition to these Unix/VMS machines, the UNMCC Distributed Systems Group (DSG) monitors a number of the various microvaxen and sun workstations scattered across campus. This duty falls to the DSG Programmer designated as "DOC", or "DSG On Call", who receives his beeper based on a monthly rotation schedule. In the past, shell scripts running every six hours reported vari- ous system statistics to DOC, who then scanned the output for signs of possible trouble. The output of these shell scripts be- came overwhelming as the number of machines and potential prob- lems grew; corresponding to this increase in output was an in- crease in the amount of time that DOC had to spend reading this output. In addition, most of this output merely indicated normal system operation; potential problems were buried amongst non- problems. Because of this, DOC could often waste a tremendous amount of time wading through system status reports, time which can be better spent actually fixing system problems. Unix is equipped with many powerful tools for program develop- ment, but none which simply watch the system for signs of trou- ble. Programs like _p_s and _d_f provide information regarding the current state of the machine, yet it still remains DOC's respon- sibility to interpret this information and assess the health of the system at any given time. This deficiency can be rectified by providing the system with the capacity to determine its own state of health, advising DOC when it notices a problem which re- quires DOC's intervention. In designing _w_a_t_c_h_e_r, the author closely examined just what DOC does in monitoring the system; just how _d_o_e_s DOC spot potential trouble in the DOC reports? These reports consist of output from _d_f -_i, _r_u_p_t_i_m_e, _p_s -_a_u_x | _s_o_r_t, and the tail of _c_r_o_n_l_o_g, which usually only changes in the middle of the night. It was determined that DOC's task consisted primarily of scanning various numbers in this output, deciding whether or not they had exceeded an allowable maximum or minimum, or if the values had changed too much from the last time the com- mand was run, assuming the last value is even remembered. Get- ting a computer to do this is more complicated than might seem at first glance, due to inconsistencies in the location of pertinent information between runs of these commands. For instance, the process occupying the fifth line of _p_s -_a_x might next time appear on the eighth line; similarly, _u_p_t_i_m_e does not consistently put germane information in the same place on the line. While flexibility is certainly a primary design consideration, it is not the whole story. In order to improve DOC's effectiveness, the program should run frequently, roughly every two or three hours, catching problems early (hopefully before they have af- fected the users). Thus, the program should also be as silent as possible except when it detects a potential problem; any advan- tage DOC gains in using _w_a_t_c_h_e_r would be eliminated if the pro- gram delivered an exceedingly verbose status report every two hours. _w_a_t_c_h_e_r's problem reports should be exact and concise, leading DOC immediately to the trouble. The problem of reducing the amount of output DOC must process can be approached in different ways, including the redesign of the current shell scripts. A simple _a_w_k script can watch the output from _d_f [1]. However, each command would require a custom tailored _a_w_k script to look at it. This task grows more compli- cated as the number of programs running increases. While a pro- gram could be written to generate these _a_w_k scripts, this process is needlessly complex; for only a bit more work, an efficient C program such as _w_a_t_c_h_e_r can be developed. Run at intervals speci- fied in _c_r_o_n_t_a_b, _w_a_t_c_h_e_r parses a control file (./_w_a_t_c_h_e_r_f_i_l_e by default) with a _y_a_c_c generated parser, building a data structure containing all of the information from the file. The file con- tains the list of commands _w_a_t_c_h_e_r should run (the pipeline), output specifications for each command (the output format), and the guidelines used in determining if something is amiss and should be reported to DOC (the change format). A sample _w_a_t_c_h_e_r control file would look something like this (comment lines begin with a '#'): # Here is the pipeline and its alias: (df -i | /usr/ucb/tail +2) { df } # the output format; this is a column output format: $1-9 device%k $41-42 spaceused%d $64-65 inodesused%d: # and the change format: spaceused 15%; spaceused 0 89; inodesused 15%; inodesused 0 49. # another command example: (/usr/ucb/ruptime | fgrep -f UnmHosts) { ruptime } # this is a relative output format 2 status%s 1 machine%k 7 loadav%d: # and another change format: loadav 0 10; status "up". The first entry causes _w_a_t_c_h_e_r to run the _d_f pipeline listed in parentheses. When reporting problems, _w_a_t_c_h_e_r refers to this command by the alias provided in the braces; if no alias appears, _w_a_t_c_h_e_r uses the entire pipeline. The output format instructs _w_a_t_c_h_e_r how to parse the output; column format, indicated in the output format by num-num, in- structs _w_a_t_c_h_e_r that the output should be parsed by columns, while relative format, denoted by a single integer, shows that the output should be broken up by whitespaces. Through the con- vention name%type, the output format also names each field, indi- cating whether the field is numeric, string, or keyword, speci- fied by d, s, or k respectively. Keyword fields are used to match up corresponding output lines between runs. Thus 41-42 spaceused%d indicates that this field, named spa- ceused, contains numeric information in columns 41-42, while 2 status%s informs _w_a_t_c_h_e_r that the second word (group of non-whitespace characters) on the line is a string field named status. For the _d_f example given above, Filesystem kbytes used avail capacity iused ifree %iused Mounted on /dev/hp1f 52431 39763 7424 84% 6937 9447 42% /develop device would be /_d_e_v/_h_p_1_f, spaceused would be 84, and inodesused would be 42. Similarly, the output from the _r_u_p_- _t_i_m_e example, which looks like this charon up 26+07:53, 17 users, load 3.12, 2.90, 2.66 would be broken at the following places: charon | up | 26+07:53, | 17 | users, | load | 3.12, | 2.90, | 2.66, assigning "up" to status, and 3.12 to loadav. The name field also appears in the change format, designating al- lowable values for this field to have. These values can be specified as single character strings in the case of string fields; in the case of numeric fields, the values take the form of either percentage or absolute changes, or a minimum and max- imum which delineate an acceptable range. Thus ino- desused 15%; inodesused 0 49. signifies that DOC should be notified if the field named inodesused increases by more than 15% from the last run, or if it is outside the range 0 to 49; similarly status "up"; informs _w_a_t_c_h_e_r to notify DOC if the status field contains anything other than the word "up". As _w_a_t_c_h_e_r parses the output of a pipeline, it stores the per- tinent parts of the output in a history file (by default, ./_w_a_t_c_h_e_r._h_i_s_t_o_r_y). The next time _w_a_t_c_h_e_r runs, it reads this file to provide comparison values for the command. If a command is new (i.e. it has no previously-stored output in the history file), _w_a_t_c_h_e_r checks the fields which require no previous data, such as min-max fields, while still storing _a_l_l of the relevant information to the history file. Thus, the next time the new command is run, it will be an _o_l_d command, and meaningful between-run comparisons can be made. When _w_a_t_c_h_e_r detects no problems with the system, DOC receives an empty mail message with the subject "_h_o_s_t_n_a_m_e had no problems at _d_a_t_e"; this is to insure that _m_a_i_l is running correctly. When it notices a problem which should be brought to DOC's attention, it mails the system problem report in a concise format, explaining what is wrong and why. Thus, rather than the megabytes of shell script output that DOC used to receive and have to read, he mere- ly sees this when he reads his mail: Mail version 5.2 6/21/85. Type ? for help. "/usr/spool/mail/ingham": 5 messages 5 new N 1 root@charon.unm Sat Apr 11 16:00 8/212 "charon had no problems at Sat" N 2 root@ariel.unm Sat Apr 11 16:00 8/208 "ariel had no prob- lems at Sat " N 3 root@geinah.unm Sat Apr 11 16:00 11/417 "System problem report for gei" N 4 root@izar.unm Sat Apr 11 16:00 8/204 "izar had no prob- lems at Sat A" N 5 root@deimos.unm Sat Apr 11 16:00 8/212 "deimos had no problems at Sat" The letters indicating no problems can be im- mediately deleted, and DOC can turn his attention to the letter indicating a system problems. A sample problem report would look something like this: df has a max/min value out of range: /dev/hp0h 140488 111195 15244 91% 10145 28767 26% /usr where spaceused = 91.00; valid range 0.00 to 89.00. Also it had inodesused change by more than 10%. Previous value 20.00; current value 26.00. Note that if a line has more than one indication of a problem, all anomalies are included in the report. This provides DOC with as much information as possible, allowing him to determine the problem quickly and devise a rapid fix (hopefully before users know something is amiss). _w_a_t_c_h_e_r's primary advantage lies in the reduction of DOC's work load. It has taken over the more menial aspects of monitoring a system, tasks like reading and comparing numbers, giving DOC more time to concentrate on bugs of a nature which _w_a_t_c_h_e_r isn't set up to monitor, such as problems in the accounting system. DOC is apprised of potential problems quickly, and in some cases can repair them in less time than simply reading the shell script output would have taken. The ability to monitor changes between runs has also helped bring to our attention some problems which were missed in the DOC re- ports. For example, disk space on /_u_2 on one of our machines jumped by more than 15%. Since this jump did not force the total space used above 90%, at which point DOC would have investigated the filesystem, it is unlikely that DOC would have even noticed this sudden change. The facility to watch for relative changes between runs enables DOC to catch problems in their infancy, and fix problems such as filesystems filling up too rapidly before they inconvenience the users. Since the system manager specifies not only the commands _w_a_t_c_h_e_r will execute and the time lapse between successive runs, but also the parameters which indicate system anomalies, _w_a_t_c_h_e_r can easi- ly be seen as a very flexible, general system monitor. Its use at UNM has provided an increase in the productivity of the system manager, which has led in turn to the increase in the reliability and availability of the systems at UNMCC. _w_a_t_c_h_e_r will be sent to the moderator of mod.sources after the conference is over. [1] Monitoring Free Disk Space, Rik Farrow, Wizard's Grabbag, _U_n_i_x _W_o_r_l_d, Vol. IV, no. 3, pp. 86-87.