Tuesday, August 14, 2007

Computer: Heal Yourself!

The Autonomic Computing Initiative at IBM tries to do some really interesting things. The goal for IBM is to make server hardware run without much human intervention. IBM breaks the problem down into four different parts:

1. Automatically install and configure software
2. Automatically find and correct hardware faults
3. Automatically tweak software and hardware for optimal performance
4. Automatically defend itself from potentially unknown attacks

This is an ambitious goal, of course. They don't intend to complete the project right away. #2 is the interesting one from the point of view of PC-Doctor. However, I'd like to try to look at it from IBM's point of view. They (unlike PC-Doctor) have a lot of influence on hardware standards. The question they should be asking is "What sensors can be added to existing hardware technologies to enable us to predict faults before they happen?". Fault prediction isn't the whole story, but it's an interesting one.

I'd better admit right away that I don't know much about computer hardware. Uh... "That's not my department" sounds like a good excuse. However, I hang out with some experts, so it's possible that a bit has rubbed off. We'll find out from the comments in a week or two! :-)

Hard drives:


This is an easy one. The SMART (http://www.seagate.com/support/kb/disc/smart.html) standard already allows software to look at correctable failures on the hard drives. If you look at these errors over time, you may be able to make a guess about when a hard drive will fail.

This is nice because the hardware already sends the necessary information all the way up to application software running on the computer.

Flash memory:


Flash memory can also fail slowly over time. I don't know of any effort to standardize the reporting of this information, but, at least on the lowest level, some information is available. There are two things that could be looked at.

First, blocks of flash memory will fail periodically. This is similar to a hard drive's sector getting marked as bad. Backup blocks will be available. Some errors during the fabrication of the device will also be marked as bad and replaced before it ends up on a computer. Device manufacturers probably don't want to admit how many blocks were bad from the beginning, but a company like IBM might have a chance to convince them otherwise.

Second, you could count the number of times that you write to the memory. Manufacturers expect a certain number of writes to cause failures in the device, but I don't know how good these measurements would be at predicting failure.

Fans:


A lot of servers these days can tell you when a fan has failed. They might send some email to IT staff about it and turn on a backup fan. It'd be more impressive if you could predict failures.

Bearing failures are one common failure mode for fans. This frequently creates a noise before it fails completely. A vibration sensor mounted on a fan might be able to predict an imminent failure. You could also look at either bearing temperature or the current required to maintain fan speed. Both would provide some indication of increased friction in the bearing.

Network card:


Some Marvell network cards can test the cable that's plugged into it. The idea is to send a pulse down the cable and time reflections that come back. The Marvell cards look at failures in the cable, but you could do a more sensitive test and measure when someone kinks the cable or even when someone rolls an office chair over it. If you constantly took measurements of this, and you kept track of changes in the reflections, you might get some interesting info on the cable between the switch and the computer.

Printed wiring boards:


You could do some similar measurements with the connections on the PWB that the motherboard is printed on. This might help you learn about some problems that develop over time on a PWB, but I have to admit that I have no idea what sorts of problems might be common.

Shock, vibration, and theft


Can you get some useful information from accelerometers scattered throughout a computer? Notebooks already do. An accelerometer placed anywhere in a notebook can detect if it's in free fall and park the hard drive heads before the notebook lands on the floor.

A typical server doesn't enter free fall frequently, though. One thing you could look for is large vibrations. Presumably, large vibrations could, over time, damage a server. Shock would also damage a server, but it's not obvious when that would happen.

Security might be another interesting application of accelerometers. If you can tell that a hard drive has moved, then you could assume that it has been taken out of its server enclosure and disable it. This might be a good defense against someone stealing an unencrypted hard drive to read data off of it. This would require long term battery backup for the accelerometer system. It would also require a pretty good accelerometer.

IBM sounds as though they want to make some progress on this. It would be really nice to be able to measure the health of a server. Most of my suggestions would add some cost to a computer, so it may only be worthwhile for a critical server.

Now, after I've written the whole thing, I'll have to ask around PC-Doctor and see if anyone here knows what IBM is actually doing!

This originally appeared on PC-Doctor's blog.

No comments: