Keeping watch over the flocks at night (and day) Kenneth Ingham University of New Mexico Computing Center Distributed Systems Group 2701 Campus NE Albuquerque, NM 87131 (505) 277-8044 ingham@charon.unm.edu ucbvax!unmvax!charon!ingham Topic Areas: Applications, System management, Utilities The computing facilities offered by the University of New Mexico Computing Center include three microvaxen, five large vaxen (780 or bigger), and a Sequent B8000. In addition to these Unix/VMS machines, the UNMCC Distributed Systems Group (DSG) monitors a number of the various microvaxen and sun workstations scattered across campus. This duty falls to the DSG Programmer designated as "DOC", or "DSG On Call", who receives his beeper based on a monthly rotation schedule. In the past, shell scripts running every six hours reported vari- ous system statistics to DOC, who then scanned the output for signs of possible trouble. As the number of machines and the number of potential problems grew, the mound of output that DOC had to process, most of which merely indicated normal system operation, became overwhelming. Now, with several machines to monitor and only one person acting in this capacity, DOC can often waste a tremendous amount of time wading through system status reports, time which can be better spent actually fixing system problems. In response to this situation, the author developed a tool which introduces some intelligence into the machine's self-reporting, letting the machine filter out messages indicating normal opera- tion and forwarding to DOC only those messages which point out trouble areas. The result of these efforts is Watcher, a very general and extensible system self-monitor. Running more often than the set of shell scripts, Watcher keeps closer tabs on the system; since it delivers only a summary of potential problems, however, this extra monitoring produces _n_o corresponding increase in the demand on the system manager. No problems slip by unno- ticed in the more concise output, leading to an improvement in overall system availability as well as the more effective utili- zation of the system manager's time. Watcher was designed to be almost as flexible as DOC in deciding what constitutes a problem with the system. Running at intervals specified in crontab, Watcher issues a number of user-specified commands (each of which delivers its output in a different for- mat), parsing all or part of the output from either the left or the right. It compares this to the last such output obtained, checking for indications of a system abnormality. Such signs might take the form of a too abrupt change in a certain value (e.g. a process which suddenly begins gobbling vast amounts of cpu time), a value which exceeds the allowable maximum or minimum (such as a an overly-full file system), or an unacceptable change in a string value (e.g. when "up" changes to "down"). For com- mands such as "ps" whose output varies considerably with each run, specific parts of the output can be designated as a key; successive runs of Watcher will home in on these key areas for their comparisons. Since the user specifies not only the commands Watcher will exe- cute and the time lapse between successive runs, but also the aforementioned parameters which indicate system anomalies, Watch- er can easily be seen as a very flexible, general system monitor. Its use at UNM has provided a marked increase in the productivity of the system manager, which has led in turn to the increase in the reliability and availability of the systems at UNMCC.