[Access to Wang masthead]

Power Off!

When the lights go out, don't be in the dark

From "VS Workshop",  Access to Wang, April 1989
  [ Prior Article ]     [ Return to the Catalog of articles ]     [ Next Article ]  

Anyone who has spent much time in operations has seen it. Suddenly the lights flicker, then dim, finally dropping to complete darkness. The quiet in the computer room is ominous; so, too, are the groans of users realizing the impact of what has just happened. The tranquility of the idle equipment is soon replaced by ringing telephones and irate shouts in the hallway.

Power failures are a way of life in some locales, and for mainframe users uniterruptable power supplies and diesel generators are standard equipment. This is less true in the minicomputer world, even though the price of such units has dropped as the cost of information loss has risen. For most of us, a power failure is another form of disaster - one that has a greater chance of occurring than most.

Like all good administrators, we have a plan for such events. True, it was written five years ago when the system was in a closet and, no, we haven't tested it. . .

Here are some of my thoughts on this topic, sharpened recently by experience. I advocate a three-step plan: stabilize the system, extract and analyze possible damage, and then act to repair or restore information.

Prepare for the worst

Before you need them, you should have a few items ready for a failure. These include several utilities (DISKINIT, LISTVTOC, VERIFY, COPY, BACKUP) and a few miscellaneous items (flashlight, backup system packs, offsite storage records, checklist). A power analyzer, though expensive, may pay for itself by allowing you to determine the extent of the problem and detect when it has stabilized; it can also give electricians and utility companies some idea of what happened.

Get down

Your first move when the lights go out should be to reach calmly into your desk drawer for a flashlight. It's easy to overlook this simple necessity, so take time today to make sure you can see in all conditions.

If the machinery has stopped completely, turn it off now. When electrical service is restored, there is often a surge on the unloaded circuit followed by a sag ("brownout") as large motors in the neighborhood kick in. Don't expose your sensitive system to either.

While power failures typically appear to be complete outages, more often they are extreme sags. The power for a circuit may drop to a third or quarter of its usual output for a second or more, suspending operation of many types of machinery. In many cases it is impossible to tell if there is still power in the circuit unless a power analyzer is present.

Some of the worst damage to disk drives can occur during voltage fluctuations that cannot be detected by eye. For this reason, it is important that a relay or breaker be placed in the power circuits to the computer so that a drop in power beyond a reasonable threshold will cut all power to the equipment. Such a cut-off is necessary to avoid the possibility of a disk head responding with abnormal enthusiasm and engraving the surface of the disk - the legendary head crash.

The responses of electrical equipment to a brownout will vary. With their sensitive circuitry, computers and digital telephone systems are the first to go; commercial fluorescent fixtures are also early casualties. Motor-driven equipment with non-digital controls and analog telephone systems can ride through many problems.

When the power returns

Your first impulse when the power returns may be to leap into the system and begin the long process of recovery. Don't do it! Wait for the power situation to stabilize before attempting recovery. Users will pressure you, managers will yell, but if the power drops again, you may be in for worse damage than before. If you have one, watch your power monitor for normal activity and wait it out.

How do you tell if the power is stable if you don't have a power analyzer? It is definitely preferable to have one, but some types of fluctuations can be observed in equipment, motors, or fluorescent lights. Sometimes a common volt/ohm meter can be used - provided you don't mind watching it continuously for a half hour or more. Temporarily suspend your optimism and do the best job you can to find a problem.

Damage control

After you are convinced the power is stable, cautiously bring up the system with logons inhibited. It might be best to start with a single terminal and disk drive (i.e. bypass the normal configuration) so you can verify that the system has not been grossly damaged without endangering most of your data.

If all seems well, you can power up the other drives and restart under the normal configuration. For safety's sake, write protect all but the system drive so no records will be accidentally updated.

The verification process should begin with a check of the physical health of the drives. This is accomplished by running the VERIFY option of DISKINIT. Common errors detected by this process include missing extents; nasty errors may cause the system to hang. If you suspect there is anything wrong with a disk drive, stop using it immediately. In particular, don't mount any other volumes on removable drives with suspicious health; if a head crash has occurred, you'll ruin any other disks that go into that drive.

If DISKINIT runs clean for all packs, verify the catalog of all volumes with LISTVTOC. Again, minor errors may occur; use you judgement whether you must deal with them now or at a later date.

If the physical and catalog areas of the disk are intact, your analysis may proceed to the file level. At this point you have a choice: whether to allow Word Processing users to resume. Since document damage is likely to be light and the system will usually recover automatically, it is not unreasonable to allow WP users back on. At worst, a document reorg with COPYWP is often all that is necessary to continue, even if the document was in the editor during the outage.

Data processing systems with indexed files are another matter. In most cases, indexed files will be left open and require reorganization to correct the record count, file pointers, and other internal data. While Wang's VERIFY utility is the final word on the integrity of indexed files, commercial disk management software is often faster and catches most errors. Since integrated applications usually require all files, a problem with a single file means that the system is unavailable.

Even if all of the files check out, it is still possible for partial transactions to cripple a system. Ask users of integrated data processing systems if there is a way to verify the data from account balances or other measurements.

Intervention: damage found

Consider it a given that a large percentage of your indexed files will require reorganization. Naturally, this means use of the Wang COPY utility with the REORG option. As mentioned, some commercial products can speed the process of locating errors and may tie to the COPY utility to begin repair.

Disk errors require more care. Errors to the physical surface must be attended to by service personnel. Serious disk catalog errors are frequently unrecoverable. (Some firms claim to be able to recover "lost" data on disks, but require a few days to do it.) Minor catalog errors should be corrected by a full backup and initialization of the disk at the earliest opportunity.

If some or all of the data is unrecoverable, restoration from prior backup may be necessary. Due to the synchronization of data between files in a system, this usually means restoring all files in a system - and a loss of all activity after the time of the backup.

The road to recovery

In most cases, several hours will be required to perform the steps outlined above. In many cases, this results in a high loss to the organization. Worse, additional fluctuations could bring back similar problems at any time. If your informational needs are critical, it may be time to consider a Uninterruptable Power Supply. At the least, you should be aware of the cost of a power failure.

Recap: a checklist

Before the power drops:

When the power drops:

When the power returns:

For minor damage:

For serious damage:

Glossary

Term Definition
Analog Telephone Systems Private telephone switches that use voltage and relays for their switching activity.
Brownout Another term for a sag.
Digital telephone systems Private telephone switches that use digital processors for their switching activity.
Head crash damage to a disk surface caused by contact with a disk head.
Power Analyzer Expensive test equipment that detects and records disturbances in electrical power.
Sag A sudden decrease in voltage or current.
Surge A sudden increase in voltage or current.
Uniterruptable power supply (UPS) A power source that continues even if normal electrical service stops. Typically consists of a power conditioner and batteries that supply power for a short period, such as fifteen minutes.
Volt/ohm meter A common and inexpensive multi-purpose meter used to measure voltage and resistance.
Write protection A option on disk drives that allows any record to be written but prevents any from being written. Set on most disk drives with a front-panel switch.

  [ Prior Article ]     [ Return to the Catalog of articles ]     [ Next Article ]  


Copyright © 1989 Dennis S. Barnes
Reprints of this article are permitted without notification if the source of the information is clearly identified