|
Statistcally Speaking, It's not Probable
From: Geoff Collyer
Subject: heater explosion -> flood -> air conditioner failure -> disk failure
Organization: Statistics, U. of Toronto
Date: Mon, 26 Mar 90 02:31:46 EST
It's been a day out of the National Enquirer or the Old Testament:
flood, fire, famine, ruin, everything but locusts. This account is
based on reports from Ruth and my own observations.
In the early morning of Saturday (the 24th), a water heater in the
non-existent 7th floor of Sid Smith, which would hold a lot of Phys
Plant equipment if it existed, either disintegrated or exploded,
unleashing a flood of water down the staircase that you enter from SS
6058. *Somehow* Phys Plant actually noticed this and sent a repair
crew over, who fixed or replaced the heater after turning off the water
supply (not just hot water, *all* water). This made our water-cooled
air conditioner unable to cool, so our machine room overheated, again
pegging the thermometer. Our poor Eagles; I dinna think they kin take
much more o' this, cap'n. Indeed, this time the Eagles didn't just
report spurious i/o errors; they took a serious dislike to /usr and
/usrb (/usrb contains our blit support, stats packages and news
articles) and scrambled them up real good. At about 0637, utstat hung
or crashed or something similar, while reasonably busy (load average of
4.4).
Ruth arrived at about 0900 Saturday, noticed stat down, investigated,
found the machine room broiling the disks, shut the disks down, got the
air conditioner running again, let the room cool, tried to boot stat and
got "UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY" and phoned me at home
at about noon. I came in, ran some "fsck -n"'s to see the extent and
type of the damage. Most file systems had minimal or no damage. It
looked like there was substantial damage to the file system structures
on /usr and /usrb (hundreds or thousands of duplicate blocks, some in
files with no names; ordinary files that had become directories) and
there were hints that the damage extended to data blocks too. I ran
"fsck -y" on /usr and copied off all files changed since the last full
dump on Wednesday. "fsck -y" on /usrb didn't terminate after 20 CPU
minutes and over a megabyte of complaints, so I interrupted fsck,
mounted /usrb read-only and copied off the mailboxes and a couple recent
hacknews articles; nothing else had changed since the last /usrb dump.
I ran newfs'ed /usr and /usrb, started restoring /usr off gpu's tape
drive and /usrb off artsci's tape drive; during the second reel of
each, my window manager (wm) wedged unrecoverably and, this being
single-user mode, I had to abort the restores, newfs the partitions
again and start over. This time, I just restored /usr, which took most
of 4 hours, and put off restoring /usrb until Sunday evening.
Recently-changed /usr files have been copied back on top of the
restored /usr. mail is temporarily back on /usr.
utstat was back up (minus /usrb) at 0219 Sunday, but the backlogs of
client network requests and incoming mail and news kept stat's load
above 10 for 15-20 minutes. Furthermore, the clients seemed to be
confused and battering stat with requests, so I rebooted all of them
and the load returned to normal.
/usrb was restored Sunday evening and everything should now be back to
normal (aside from the location of /usr/spool/mail).
--
Geoff Collyer utzoo!utstat!geoff, geoff@utstat.toronto.edu
|