|
The Trials of Sobeco
From: Jean-Francois Lamy
Date: Sun, 14 Jul 1991 12:59:41 -0400
sobeco.sobeco.com has been down since July 4th. MIPS Canada has gone as far
as shipping us a replacement system last Wednesday after shipping us
replacement CPU and I/O co-processor boards in response to the crash (the CPU
takes up to 3 of those IOCs, where we probably have blown a 1.25$ UART and
can't talk to the console).
That machine runs the "production" stuff for the actuaries and external
service bureau stuff for the DP applications group, and is therefore
"mission-critical". MIPS sells 6280s as SMD-only and 6260s as SCSI-only. I
hate SMDs and things seemed to run fine with an SMD boot disk and SCSI user
disks, so we had been running an unsupported mixed disk configuration. But we
had had occasional hangs, and in order to get better performance had decided
to tack on an additional VME bus (the machine can have up to 6) and extra disk
controllers and to go SMD only so we'd be fully supported.
In the first weekend of June we had tried to put 3 more 1.7gig SMD for a total
of 4. Then we ran into an obscene series of problems with the format software
(resulting in lost or corrupt factory defects on 3 of the 4 drives). In 34
hours of work we had only managed to add one extra drive. We had a 3350 on
loan (33mhz R3000 with 2VME slots + internal SCSI) and Jeff spent another week
fighting the format battle on that machine, to no avail.
So we were stuck with a dead machine with no single machine able to replace it
(the 3350 only had 16megs). So since the crash we've run with
- the 3350 impersonating the 6280 with SMDs attached and remote mounts to
a 3240 (25mhz R3000), lots of fun since NFS locks don't work and SVR3
software uses lockf a lot.
- a 3240 impersonating the 6280 with parts of SMD remote mounted (the
replacement IOC made things worse, we started to lose our disks on crashes
after installing it). Losing /var is no fun, and of course we also lost
the partition where a project for the CEO himself was being rushed before
he left this last Friday.
- the same 3240 with all the goo copied to SCSI (with the ensuing night
dealing with screwdrivers -- we now have all disks in one cabinet so
we can "cabinet swap" the CPU.
- a new 6280 (MIPS Canada's own machine -- the president happened to have
the bad luck of scheduling a meeting the day the replacement CPU for
the original machine caused it to crash and take out the two SMDs it
had, guess he had to do something about it). At this point I was running
the 6280 as a 6260, meaning SCSI only, cause I had the feeling we'd have
to revert once more to the 3240
- the 3240 again, since the new 6280 now hangs with something totally
different: it fails its cache memory tests on reset, but passes them
on cold boot.
So this week we've installed or re-installed the production environment 4
times on 2 6280s, a 3240 and a 3350, scrounged and installed a dozen disks,
tried to format 4 SMDs on 3 different machines, 2 revs of the OS and 3
different boards, and rearranged things so often that our operators are
completely confused as to which machine is which one today ("machine du jour")
I guess we did score a small victory: one of the librarians in the library
automation group thought I was in vacation when I showed up in jeans and
T-shirt instead after 2 24-hour stints with 5 hours sleep after each instead
of my usual suit and tie. He had not noticed a thing -- the development
and office automation systems are separate, though we lost one of the latter
as well just to make things interesting.
It is not over yet -- we're bracing up for another week on the 3240 since both
the 6280s we have are broken and I will not let them either go in production
even if they get fixed before the SMD disks are formatted and the new
configuration is burned in. We're to the point where MIPS' 6280 team leader
knows our call numbers by heart, and may send someone up here.
We have a 64 megabyte board on order for the 3350, as well as an SMD and
additional SCSI controller for it; management has suddenly realized that the
cost of a hot spare (we're mostly I/O bound and I guestimate that the 3350
will be within 25% of the 6280 for most of the day) compares favorably
with the cost of downtime and service contracts. Large computers
are only easier to manage than smaller ones when they are reliable. Swapping
small computers is certainly much easier, and the point of dimishing return
is probably at the level of the high-end workstation when there are natural
political boundaries to contend with at the granularity of workgroups. I'd
say large computers are best used as compute servers unless you have a
fault-tolerant setup.
Sign of the times, maybe, but I'm beginning to think that maybe people
who buy from IBM do get something for their service money.
-- Jean-Francois Lamy
|