Stories for Computation: Why Care Is Needed

Next up: The Columbia Report

Official Inquiry Report

Not All Goofs Are Computational

This website is devote to pictures of accidents. Some pictures are sad; some should be submitted for the Darwin Awards.

Cerro Grande Fire

The Complete Web Site by NPS.

From the Executive Summary. "On May 4, 2000, in the late evening, fire personnel at Bandelier National Monument, National Park Service, ignited a prescribed fire with an approved plan. Firing and line control occurred during the early morning of May 5. Sporadic wind changes caused some spotting within the unit and a slopover on the upper east fireline. Because of the slopover the prescribed fire was declared a wildfire at 1300 hours on May 5. The fire was contained on May 6 and early on May 7; however, at approximately 1100 hours on May 7 winds increased significantly from the west and resulted in major fire activity and ultimately caused the fire to move out of control to the east on the Santa Fe National Forest. The fire was taken over by a Type 1 team on May 8.

"In its most extreme state on May 10, the Cerro Grande Prescribed Fire was carried by very high winds, with embers blowing a mile or more across the fire lines to the north, south, and east, entering Los Alamos Canyon towards Los Alamos, New Mexico. The towns of Los Alamos and White Rock were in the fire's path and more than 18,000 residents were evacuated. By the end of the day on May 10, the fire had burned 18,000 acres, destroying 235 homes, and damaging many other structures. The fire also spread towards the Los Alamos National Laboratory, and although fires spotted onto the facility's lands, all major structures were secured and no releases of radiation occurred. The fire also burned other private lands and portions of San Ildefonso Pueblo and Santa Clara Pueblo. As of May 17 the fire was uncontrolled and approaching over 45,000 acres."

Epilogue. Comments from colleagues at Los Alamos, Sandia, and the Shodor Education Foundation, led me (Stevenson) to believe the fire was started in inappropriate conditions due to a weak model and inproper interpretations of the limits of the parameters. I asked the National Forest Service about this

"It was my understanding that the Cerro Grande fire came about because there was a model that showed how the decision parameters must be distributed to be a safe burn. .... [The published report] shows the numbers for the decision, but not how the numbers were supported."

That email was answered by Mr Dick Bahr fo the Fire Use Specialist at the NPS - Fire Management Program Center.

... You are correct about there being a model showing how the "risk" decision parameter were distributed. Unfortunately, at that time we had nothing in policy or procedures that required a justification in the decision process. Sorry, we have no ability to document the support of the numbers that were used in the Frijoles prescribed fire that became the Cerro Grande fire...."

Cassini-Huygens Communications


The Whole Story

Boris Smeds, an ESA veteran, insisted on a complex test of the Huygens-Cassini communication system. This test revealed a serious error in the computer program aboard Cassini that would have made it impossible for Cassini to deal with the doppler shift between Cassini and Huygens. As Huygens descended and landed on Titan and Cassini would be making close fly-by of Titan. The relative velocities of the two spacecraft was too great for the Cassini receiver to handle.

Cautionary Tale in Modeling

Donella H. Meadows, et al. The limits to growth; a report for the Club of Rome's project on the predicament of mankind, New York: Universe Books, 1972.

This is actually an old tale from the 60s but it is worth repeating. The Club of Rome is an active, interdisciplinary, international think tank. In April, 1968, the Club of Rome met to consider the future.The most sensational conclusion was that the limits to growth on this earth will be reached in the middle of the Twenty-first Century, followed by a dramatic, uncontrollable collapse of population, food production, and all the other significant measures of a society's welfare. More can be found Review.

The conclusions are technically discredited, at least partially because of inappropriate modeling and simulation. In many ways, this work set computer simulations back many years. The point being that mathematics is what you use to write down a model of behavior but it is no substitute for a valid model.

 

Material from Prior To 1 December 2004

It All Starts When We're Young

We get conditioned early to accept that real world math is different from school math. And we get conditioned to not complain about pennies.

3 Oranges for a Buck

When I was a kid, I worked in my father's newstand. You know, one of those old catch-all stores; it had all sorts of stuff, if you know what I mean. Lots of that stuff had funny prices, something like ``3 for $1''. When the kids would buy one, we were forced to choose between charging $0.33 or $0.34.

  1. If the charge was $0.34, the kids knew they were being ripped off, but not how much were they being ripped off. The immediate answer is "two-thirds a cent." But there's another way to look at it that erases the magnitude from the answer and answers instead the question, "what relative difference is there?" This would be (33 1/3 - 34)/( 33 1//3) = 2%. Two-thirds of a cent doesn't sound so bad. Two percent sounds much worse.
  2. The same logic works for charging $0.33. In this case, my dad was getting ripped off for one-percent.

This calculation is known as the relative error between an exact value (33 1/3) and the computed value (34).

But a Bit of a Payback

Taxes was the next thing that I learned about the hard way. Where I lived had a five percent sales tax. In the first place, it wasn't five percent. Had it been five percent then the first value that would have been taxed would have been $0.10. The first amount for which there was a tax was $0.13. We had lots of things in the store that were priced at $0.12 so that kids could buy them, in effect, tax free. But the State charged us taxes on the total amount. Suppose there are ten items. Now, which way is the error? From my dad's stand point, he was being overtaxed by five percent. On the other hand, the state could say that they should get more. Had my dad collected a penny on each sale, he would have collected ten cents and so would have had 4/126 or approximately 3.17 percent in overcharged tax.

These are two really simple examples of the ways in which approximate arithmetic hurts one economically. To a kid in the 1950s, that one percent is a killer; I also doubt any company can afford to waste one percent of their resources. The problem is, computers do arithmetic much worse than you might imagine. We have not seen the effects of round-off error. Round-off error isn't an error in the usual meaning of the term. It simply means that the exact and computed versions differ. For the most part, a single computation, like one multiple or one add, does not cause problems. Where the problem arises is when we many operations. The worst case is called catastrophic cancellation of significant digits that arises when adding many numbers that are about the same size but of differing signs. Like debits and credits when interest has been applied!

Representation of Numbers

Arithmetic problems also begin with the choice of representation. Aren't all numbers represented the same? No, because the computer uses binary representations. For example, the number 10 is represented as 110 in binary. This is not a problem for integers since we can convert between bases (that's what 10 and 2 are called) correctly. The number π is not representable in any finite number of digits, regardless of base. But even simple numbers can cause problems. For example, 0.1 (one-tenth) cannot be represented exactly in binary arithmetic. That mean's there is always an error, called the round-off error. There are two conventional ways to convert numbers like 0.1.

  1. One is the conventional rounding rule we learned in school. In this case, we round up or down to the nearest representable number.
  2. The other way is called chopping or truncating. In this rule, we just throw away the excess information.

So? The error in chopping is twice that of rounding. IBM (and clones) mainframes use chopping. This means that calculations done on an IBM mainframe probably cannot be duplicated on any other computer! Even worse, some financial calculates like the coupon values of bonds, are required by convention, to be chopped while other calculations are rounded. Over the course of many calculations, someone is losing lots of money if the principals are large enough and there are enough transactions. Like banks, maybe?

The Telephone Companies

The telephone companies have obviously learned the lesson above: the line of reasoning leads to ``rounding up to the next minute.''

A Simple Example

Many of the above problems can be illustrated with the simple example below. Here's a really simple example of what is going on. This is a mathematical simplification to focus on where the problem is. Suppose you have a ball with radius R. Tie a string around the ball and the add some extra, say L. [Units aren't particularly important.] How high must a pole be to stretch the string taut.

The problem can be solved symbolically by elementary trigonometry and analytic geometry methods (nothing past what a high school math student should be able to do). Let R be the radius of the Earth (but aren't we an oblate spheroid?), let h be the height of the tower and let x be the distance to the horizon. We also need the angle subtended by the distance to the horizon, call it z.

(1) x^2+R^2 = (R+h)^2

(2) z - tan z = L/(2R)

Equation (1) has problems when R and h are such that R+h is about the same as R. This happens when the angle in Equation 2 gets really small. The same problems occur in reverse when the angle gets close to 90 degrees. The problem is computationally easy to solve for L/(2R) close to 1. If this ratio gets too small (R huge, like Earth size, and L small, like ten inches) or too large (L huge and R small) the answer becomes harder and harder to compute.

Past as Prologue

What will be the impact of computer and communication technology on society in 25 years? In 50 years? Rob Slade's review of this book (Risks 19.70) raises the interesting issue of how to see the forest for the trees. I don't have the answer. However I'd like to make some suggestions as to what the question means. A paper by Peter Drucker that appeared in the Harvard Business Review a few years ago began: "Every few hundred years, throughout Western history, a sharp transformation has occurred. In a matter of decades, society altogether rearranges itself. Its world view, its basic values, its social and political structures, its arts and institutions. Fifty years later, a new world exists. And the people born into that world cannot even imagine the world in which their grandparents lived and into which their own parents were born. "Our age is such a period of transition." From memory, Drucker suggested that the current transition can be dated from around 1950 and that, in fact it may take 75 or 100 years before the full implications become apparent. Typically, technology inventions are used at first to provide new solutions to existing problems. The horseless carriage was exactly that. Typically also, a major technology invention fails to reach its potential until other, enabling inventions have occurred. The steam railway had only limited usefulness, until the invention of wrought iron permitted rivers and gorges to be affordably bridged. Gutenberg invented the European version of the movable type printing press about 1450, but it was another 400 years before an industrial process was invented for making cheap paper from wood pulp. The impact of computer and communication technology 25 or 50 years from now might turn out to be further refinement and spread of existing uses. Then again, other enabling inventions may open directions that we cannot guess at. And they won't necessarily be benign. Suppose the advent of quantum computing allows rapid factoring of extremely large numbers. Much of today's security technology would be in immediate disarray. Or suppose widespread deployment of a secure Internet payment system puts control of the world money supply into the hands of a few maverick bankers in hitherto obscure nations. Tomorrow may be like today, only more so. Then again, it may herald another "Druckeresque" transformation. Any suggestions about already emerging technologies that may signal the latter?

Can You Count on the Computer Manufacturers?

Here are two stories, two decades apart. Same problem, same response, same solution. The third story relates an inconceivable engineering decision made for speed alone that makes many calculations incorrect. I am reminded of a definition of insanity: Insanity is doing the same thing over and over expecting a different result.

Intel Pentium II Division Error

As you perhaps recall, Intel developed the Pentium II chip. While the design was fine, the manufacturing process left the table used by the divide routine only partially initialized. This turned into both a technical and marketing disaster. Note that Intel's attitude was that no one other than a few "egg-head'' computer types would ever run into the bug and they didn't really matter. Pretty soon, though, people who did matter --- like Excel users --- started running into the bug. If you have a high tech company that relies on correct computer programs but they are developed by people who "don't matter'', where is the reliability of your product? In fact, where is the integrity of your product if that's your attitude? Cleve Moler of Mathworks kept accurate track of the whole debacle; Cleve produced a short description of the issues.

Unfortunately, the exact same bug and the exact same resonse from the manufacter (IBM) occured in the 1970s.

Galloping Gertie

But we can't blame it on computers since it happened in 1940. But would the models of the bridge been able to forecast the destruction? In those days, probably not. The ability to simulate unusual situations was not there. And obviously the engineers did not suspect it could happen.

Lockheed Electra

Here's a situation in which the model was not sufficiently correct to stop the disasters. This story appeared in two parts on the skyjack.com web site. The story is simple: in 1959, wings were falling off the Lockheed Electras. The problem turned out to be relatively simple: the engines were too powerful and set up a vibration mode in the wings that was not damped out. The vibrations literally tore the wings off the plane. The vibration mode was not accounted for in the models and hence untested.

Biblical Woes

While we might think the cubit of Noah's Ark fame is just one length, the truth is that there are many. The definition of the cubit is that it is the length from the elbow to the tip of the middle finger. Thus, there are as many ``cubits'' as there are people! The ancients settled on two definitions. The ordinary cubit was 0.42 meters and the royal cubit was 0.542 meters. For your satisfaction, look up the dimensions of the Ark, estimate the size with the two different definitions, then pick a number, say $1,000 per cubic meter, for construction costs. What if you bid the lower and had expenses of the larger?

Even The Space Frontier Is Not Immune

Below are several stories concerning NASA's problems. At the outset, we should not be too smug. NASA's problems are, unfortunately for them, out where we all can see. If you look in the technical literature of any discipline you never see an article about failures that were caught.

It's Harder than You Think

The shuttle was designed almost exclusively by simulations. This note is a story that circulated around simulation circles right after flight testing started. Next time you look at the shuttle during a landing, look at how large the vertical stabilizer and rudder (tail) are. The story is that the first test of yaw (flat turning) characteristics show that the original rudder worked backwards from how it should: if you pushed left, it went right. The story is that not only did the simulation have no significant digits in its answers, it even got the sign wrong. The redesign included a massive rudder since the orignal, much smaller design, was found totally ineffective after they got the rudder going in the right direction. [This may be a folk legend as no one the author has contacted in NASA seems to know about the story.]

STS-49: Shuttle Almost Doesn't Rescue Intelsat6

While trying to rescue the Intelsat6 in early 1990, the computer on the Endeavour, the shuttle had great difficulty in approaching the satellite. It was discovered later that the computer had computed very small, but not zero, values and the program did not handle the situation correctly. This is an elementary problem every scientific programmer should be aware of. [neumann95, p. 24]

Apollo 11 Finds Moon's Gravity Repulsive

The Apollo 11 software had a bug that made gravity repulsive instead of attractive. One gets the image of Issac Newton in Heaven shaking his head, maybe even shedding a tear. [newumann95, p. 26]

Mariner I Lost

On July 22, 1962, the Mariner I mission was destroyed. There were several problems [newumann95, p 2.6], but the software one is particularly illustrative. The ground based radar system had a bug that was supposed to compensate for varying echo delays. The one in question was 43 milliseconds. The bug came from the faulty transcription of a handwritten formula.

Another Typo

An early moonshot was way off course due to a Fortran FORMAT error. [hatton95].

All of Hubble's Problems Aren't in the Lens

On December 9, 1991, the Hubble Space Telescope shut down because the computer issued a command to redirect the antenna at a rate faster than the limit the software itself imposed.

Mercury Do Loop

This is a fairly technical problem and probably not worth the telling of the details. Suffice it to say that the error was a typo that should have been caught by any number people and things: compilers, engineers, programmers. The syntax error change a loop from executing an intended 10 times to executing once. [neumann95, p. 27]

Gemini V Lands 100 Miles Off Course

Gemini V landed 100 miles off course because some programmer played fast and loose with physics. The correct elapsed distance should be calculated by using the Sun as a fixed reference point and not a point on the Earth. The programmer instead used elapsed time thinking the reference point on Earth returns every 24 hours. The ratio of 24 hours/sidereal time is 1.00273790935 as reported by the U. S. Naval Observatory. The time difference (sidereal - 24 hours) in the two calculations put the Gemini V off by 100 miles. [neumann95,hatton95]

What's The Probability of Rain?

On October 15, 1987, Michael Fish of the London Weather Center, made the following pronouncement: ``Of course there won't be a hurricane in England.'' A few hours later the worst storm since 1703 hit, uprooting 30 percent of the trees in Southern England at speeds of around 100 mph (160 kph). A spokesman from the UK Meterological Office was reported in Computer News, Octover 22, 1987, as saying, ``The Cyber 205 is not an easy machine to work with but computers are fallible.'' The computer had mis-predicted the storm track by 160 km. Other European agencies hit correctly predicted the storm track [hatton95].

Go Get 'Em, Longhorns

On October 15, 1989, thousands of fans streamed towards Dallas-Fort Worth Airport (DFW) with fans headed for the Oklahoma-Texas football game. This is one of the busiest days of the year, every year. At 9:45 AM, the antiquated computers at DFW approach control shut down because an operator had entered what should have be a routine command but from an unauthorized terminal. The alarm rang and three seconds later the screens in front of the controllers went blank. Needless to say, but worth saying, all hell broke loose. Even though the computers came back up relatively quickly, it took hours of human toil to straighten out the mess at airports all over the region. [lee92]. More telling is the simplicity of the DFW computers versus today's system. The 1989 computers only had sixteen thousand lines of code. Todays air controller's computers have two million [lee92]. Empirical studies of computer programs has shown that the number of bugs in software varies with the logarithm of number of lines of code. At that rate, the number of bugs in the new computers should be five times greater than the 1989 codes.

You Sure You Want To Fly?

On August 5, 1988, traffic controllers at Boston Center switched to their new computers only to discover a software error is causing the system to randomly sent critical information on flight parameters to the wrong controller. Engineers attempted to return to the old system, only to find that it is not working properly. For the next two and one-half hours, the controllers struggled to use the new system and keep traffic flowing safely. [lee92].

How Close Was Armageddon?

On October 5, 1960, the North American Defense Command (NORAD) went to 99.9% alert when programers forgot that the Moon rises and would show on radar [hatton95]. On June 3, 1980, and again on June 6, 1980, NORAD went to full alert and threatened to launch everything [hatton95]. This time, it was training tapes loaded on the live system.

Would We Have Been Safe?

The estimates for the Strategic Defense Initiative (SDI or ``Star Wars'') said that there would be 30 million lines of code, all bug free. This is at least three orders of magnitude greater than ever has been achieved [hatton95]. An empirical rule in software development is the lograrithm rule: The number of catastrophic bugs in a code is proportional to log of the number of lines of code. One could expect 7 catastrophic bugs in that code.

Patriot Clock Drift

Near the end of the Persian Gulf war, an Iraqi Scud missile evaded American Patriot missiles and hit a U.~S. Army barracks in Dhahran, Saudi Arabia, killing 28. How had this happened? The Patriots had been singularly successful in stopping Scuds. There was an investigation, of course. The Government Accounting Office[gao92] released a report which was subsequently commented on in the journal Science [marshall92] and in [neumann95, p. 34] and [skeel92]. Again, the details are really technical, but the problem is easy to state. The software used two different versions of what the number 0.1. When using the different versions of the number are used in calculations, was enough for the launched missiles to completely miss the Scuds.

The Oversold Promise...

The B-1B and A7E aircraft histories contain a litany of the triumph of marketing over reality. Read Lee's account [lee92].

The Vincennes

On July 3, 1988, the U.S. Aegis cruiser Vincennes, jammed to the gunnels with computers, radars and the fanciest equipment afloat shot down an Iranian airliner that had complied with every restriction placed on a civilian aircraft in the area. The reason: it appears that Captain Will Rogers III's crew panicked and misinterpreted the information presented --- in effect, they drowned in information. The one man who needed the information, Captain Rogers, could not get a clear picture of what was going on because there was no one computer station that had the complete picture. [lee92].

Epilogue. Quoted from W. Rogers et al., Storm Center, The USS Vincennes and Iran Air Flight 655, Naval Institute Press, Annapolis, MD, 1992.

"When the Iran Air Flight 655 incident occurred, the combat information center on the Vincennes had an integrated display of the air picture. At that time, Capt. Rogers (the commander of the Vincennes) thought then that he had all the information that he needed to make informed decisions. The investigation of the incident showed that he really needed real-time information from the other two ships in his group in addition to that from his own ship's sensors. In fact, one of the destroyers accompanying the Vincennes came to the correct conclusion that the approaching aircraft was commercial but they did not relay that information to Rogers. Nevertheless, a display integrating the picture from all the cooperating ships wouldn't have help if the crews don't want to be cooperative."

Why CEOs Always Have Gray Hair

Just so you don't think that just scientific and engineering software is the only one bitten by number problems, here are a couple any business person can relate to.

Four Programs that Cannot Agree

In [hatton97], Les Hatton reports on his comparisions of various types of software with respect to accuracy and reliability. This paper should be read by anyone in software management, software engineering, or scientific programming. Briefly: Hatton compared four commercial grade software systems for dealing with seismic exploration for oil. The four developers hewed to the computer software engineering principles set forth in the literature. The four programs were given the exact same input data. The results were shown to geologists with the request that they determine whether or not there was oil --- that is, should the company commit money to exploratory drilling. Guess what: the same data led to completely different conclusions depending on the program that was used.

How is this progress?

``Ironically, while businesses rush development of new programs to gain fleeting competitive advantages, software development becomes comparatively slower and slower as the programs become more complex. Although the power and speed of business computers doubles every three years, software productivity rises a mere four percent a year. The average new business software program takes thirty-two thousand workdays to write. It would take a team of thirty-six programmers almost three years.'' [lee92]

Sabre woes....

In September of 1988, American Airlines found a problem that may have cost them upwards of $50 million. The computers completely messed up the discount ticket sales --- it apparently didn't like discount tickets so it would show flights that were sold out when there were plenty of seats [lee92].

Blue Cross Blues

In 1983 Blue Cross and Blue Shield of Wisconsin spent $200 million on a new computer system. Two notables:

By the time it was all straightened out, thirty-five thousand policyholders had switched to other companies.

And All Those Millennium Problems

Perhaps the most (currently) famous problem with computers is the ``Year 2000'' problem. This has many manifestations, but it all boils down to the fact that computers have a fixed idea of numbers. Peter Neumann has eight full pages of tales of calendar disasters[neumann95, pp. 85-92].

Company Loyalty Put To The Test

In 1985, an input error to the master inventory control program of Montgomery Ward caused an entire warehouse in Redding, California, to be ignored. It was not visited for three years, but the people continued to be paid [hatton95]. Any bets on there being another such warehouse in someone's business?

It Still Is Happening

[Candada Globe and Mail, 2 May 1998, A9.]

Can't bank on chips, CIBC finds By Suzanne Craig, Financial services reporter

Toronto--A computer glitch that has wreaked havoc with Canadian Imperial Bank of Commerce's computer system this week will be fixed by Monday, the bank says. Any deposits, withdrawals or bill payments made through CIBC bank machines or bill payments made by telephone and personal computer between Tuesday afternoon and Thursday morning have been captured by the bank but not recorded in customer accounts, the bank said.

The author makes the following key points:

Programmers will wonder if Mr Kluge's name reflects on the nature of the software problem. [but only if they pronounce it as Hold(g)yr Kloodge instead of Kloo-ge with a hard g, auf deutsch. PGN]

M.E. Kabay, PhD, CISSP (Kirkland, QC), Director of Education, International Computer Security Association (Carlisle, PA) www.icsa.net

BoNY Has $32 Billion Overdraft

In November, 1985, Bank of New York ended up with a $23.6 billion overdraft. The cause was the improper size number was used to compute with. All securities transactions had to be halted. The bank had to get an overnight $24 billion loan from the New York Federal Reserve to cover; it cost BoNY $5 million in interest. During the crisis, many customer transactions were delayed. [neumann95,lee92] Lee [lee92] quotes the figure of $2 trillion in money and securities transferred between banks and firms per day.

Computerized Trading

Recall that the October, 1987, stock ``crash'' was caused by computer trading. Because they have no judgement, programs get on a ``positive feedback'' or ``greedy'' leg of their algorithms and can't quit. There are severe restrictions on computerized trading today [lee92,hatton95].

Ah, those ATMs

In march, 1990, an ATM in Oslo, Norway, suddenly had a fit of generosity. It handed out ten times more money than the customer requested. Fortunately for Kredittkassen Bank, the computer duly noted the erroneous transactions [lee92]. In December, 1988, a bug in an ATM of two of the largest banks in the UK created chaos among the many customers making use of the services by refusing service for hours on end [hatton95].

Exchange Rate Shinanighans

Due to exchange rate errors, an Australian was able to buy Sri Lankan rupees for $104,500 (Australian). The next day, he sold them to another bank (with the right rate) for $440,258. A judge said the Australian acted without fraudulent intent and let him keep the $335,758 profit. [neumann95, p. 169]

Another Philanthropic Computer

On the night of February 25, 1988, the Australian Commonwealth Bank doubled all debits and credits. This prompted the manager to make the now famous-in-folk-lore comment, ``The effects of software errors are limited only by the imagination...'' [hatton95].

Really Friendly Computers

In 1988, a Delta computer gave away hundreds of free tickets that the awardees did not earn [lee92].

AT&T Goes Haywire --- But So Do MCI and Sprint

On Martin Luther King's Birthday, 1990 --- luckily for AT&T it was a holiday --- the Long Lines system went completely berserk. Basically, the whole system had something close to a nervous breakdown. A signal sent from a computer in New York City to one in New Jersey caused the routing of calls to become impossible. The system lost track of callers or did not route the calls. The 114 regional centers would go on- and off-line randomly. The problem lasted approximately a day. During this time approximately thirty percent of AT&T's one hundred million calls a day were lost. The cause ... hastily tested software that was hastily installed. Did anyone learn anything? All three major carriers are plagued with outages. AT&T estimates it loses a billion (yes, with a `b') dollars annually.[lee92] Similar (but unique) problems have happened to MCI and Sprint. In July, 1991, US West billed long distance at ten times the proper rate. It took three days before the error was caught. Guess it was trying to make up for Martin Luther King's birthday [hatton95].

Why Your Bags Don't Get There

On May 22, 1988, Continental Airlines opened its new counters at Newark's brand-newC terminal. The new service include computerized baggage service using UPC readers. Although flawless in testing, the first days were a disaster because the ticket agents applied the UPC stickers in every which way. The computers could not interpret the codes, but didn't bother to tell anyone [lee92].

Flying By Wire

You perhaps do not realize that just about all modern aircraft are fly by wire. This means that there are no mechanical connections between the pilot and the control surfaces; everything is done by electronics and computers. The Boeing 777, for example, has no backup systems and must rely on the three computers doing everything perfectly. This means that there are no mechanical connections between the pilot and the control surfaces; everything is done by electronics and computers. In July of 1983, Air Canada Flight 143, a brand-new Boeing 767, made an emergency landing at an abandoned RCAF airfield at Gimli, Manitoba. Their problems began when a microprocessor that monitors fuel supply malfunctioned. This cut off the engines and the electrical power. Boeing engineers thought it would be impossible to lose both engines and therefore electrical power. But 143 did and as a consequence was lost to the air traffic system since the transponders didn't work. The whole story has a happy ending because the pilot was a glider pilot who could deal with ``primitive flying conditions'' [lee92].

Airbus Crashed Because Pilots Believe Computers Infallible

On June 26, 1988, an Airbus A320 left Charles de Gaulle Airport in Paris for Basel-Mulhouse airport near Basel, Switzerland with VIP passengers destined for an air show. The A320 was to make a low and slow pass of the Basel airport as a demonstration of the computer controlled capabilities of the plane. The long and the short of it is that the pilots put too much faith in software written by non-pilots and operated the plane too close to the extremes of the envelope. The plane crashed, killing three --- including two children. [lee92]

Roller Coaster Rides

One of the failure modes for fly by wire aircraft is unexplained, and therefore unpredictable, seizing of control by the computers. This has manifested itself by violent maneuvers from which the pilot cannot regain control --- this almost always leads to crashes. The military is particularly prone to this problem. The FA-18, F-16 (both US first line craft) and the Swedish SaaB Grifen have had many incidents. For local readers, harken back to March 13, 1985, when a Blackhawk helicopter from Fort Bragg, North Carolina, crashed. The final moments of ``Chalk Three'' were observed from two other helicopters. ``...suddenly pitches sharply upwards, then pitches sharply downwards, plummeting until it impacts the earth at a near vertical angle, upside down and backwards'' [lee92].

Korean Air 007

August 31, 1983. We can all remember the news that a Korea Air 747 had been shot down by the then-Soviet Union over Sakhilin Island. The Soviets claimed the jet was 365 miles inside Soviet airspace. One popular theory is that the navigational instructions given the inertial guidance system was ten degrees in error. This is believable because of the difficult user interface and the difficulty in recovering from keying errors [lee92].

Floods

Severe flooding in 1983 was caused when a dam stored too much water due to a computer's miscalculation of snow melt [hatton95].

Nuclear Power, Anyone

So far as I can determine, Three Mile Island was caused by a combination of computers, people, and organization propensity to mess up. See [perrow84] for a full accounting. Just think, computers have had a long time to get worse. On November 24, 1991, the UK newspaper The Independent reported that a computer error at Sellafield nuclear reprocessing plant caused the radiaton safety doors to open.

Chaos by Club Activity

A bug in the Digital (DEC) VMS operating system security module allowed the Chaos Computer Club to werak havoc on NASA's space physics network. See Digital Review, November 23, 1987.

The Therac-25

The Therac-25 is a particle accelerator used to kill cancers deep in the body. In May of 1984, Katy Yarbrough was literally drilled by an overdose of energy. She would die in 1988 from a car accident after five years of excruciating pain. The clinic administering the machine never bothered to contact the manufacturer. But this was not the end of the story. [lee92] Time Line. The cause? A long-dormant bug that allowed fast, experienced operators to override safety concerns. The bug was exacerbated by a terrible user interface that encouraged operators to ignore warning messages.

A List of Other Undocumented Features

A partial list of problems over the years.

The Disappearing Lake

In 1980, Diamond Crystal Salt and Texaco had an unusual interaction. Texaco was drilling in Lake Peigneur in Louisiana. Suddenly, the drill started bouncing up and down. This is quite a trick since the rig weighs around 40 tons. The Lake was also getting shallower --- I gave it away, the Texaco rig had drilled into a salt mine owned by Diamond Crystal. You have to read the accounts by [gold81,perrow84] to get the full hilarity (no one was killed).

Training Tape

A training tape was inadvertently loaded into a live communications system at NORAD. It portrayed a massive Soviet submarine missle attack. The counterattack never materialized because alert operators tried to cross check the information from independent systems.

And how close was Armageddon?

Whenever there is an apparent threat to the United States by air (or missile), the North American Air Defense Command at Cheyenne Mountain. Colorado, gets the call. If the threat is creditable, then all the players in the U.~S. defense establishment hold a conference by phone. In 1979, NORAD held 1544 missle display conferences. This is a conference called because the system detect a missle firing somewhere. In the first six months of 1980, there were 2,159.[perrow84]

Some Famous Chemical and Nuclear Disasters

Chemical plants are notoriously hard to control. Here are some notable disasters, although most are not directly attributable to computers.

These ``super systems'' come from a class referred to as high integrity protective systems. Such systems have mind-boggling interaction complexity along with tight coupling. What's next?

Closers

The first is the story of Arab gentleman reported in Computer Talk, February 1, 1988. It seems the gentleman was driven to near sexual exhaustion due to a bug in his harem organization program [hatton95].

Nancy Leveson has led the way in alerting people to safety issues. Here are two from [leveson93]

Finally, my favorite. Nancy Leveson gave a talk at Clemson. She told the story of an aircraft manufacturer who had installed lots of computers that would make pilots a thing of the past. The engineers touted that the computers would ``fail once in 10 raised to the sixteenth power seconds.'' (That's once in 3,170,979,198.4 years --- about 3.2 billion years. The Earth is thought to be 4.5 billion years old.). On the maiden voyage of the prototype, but with a test pilot in the left seat, just in case, the plane was making its final approach when --- yup --- the computers malfunctioned. The pilot looked at his wrist, shook his wrist, and said ``My, my, my. ten to the sixteenth seconds, already. How time flies.'' So much for six sigma (``Six Sigma Engineering'' is a play on statistical terminology. It can be taken to mean ``no unplanned deviations'') engineering.

Looking at History

This tale comes from the development of the transistor.

"The model of technological innovation to which Shockley subscribed -- in which the grand scientific idea leads and the engineering process follows in a 'trickle down' sequence, working out the gritty details -- could not cope with the conditions of the late 1950s.

"The linear model is not the way this industry developed," argues Gordon Moore. "It's not science becomes technology [that then] becomes products. It's technology that gets the science to come along behind it."

He is only partly right. The applied, mission-oriented scientific research that Bardeen, Brattain, Shockley, and others did at Bell Labs in the postwar years was certainly a response to technological exigency. They tried to understand the detailed behavior of semiconductors in hopes that such knowledge eventually would lead to useful new devices. "Respect for the scientific aspects of practical problems," Shockley often called this attitude. But their work was based on a firm foundation of quantum theory and a broad understanding of atomic and crystal structure -- curiosity-driven research done during the first third of the century, mainly in Europe, with but passing regard for its practical applications. American pragmatism may have fashioned the transistor and microchip, but it did so from a fabric woven across the Atlantic by speculative, philosophical inquiry. --Michael Riordan and Lillian Hoddeson Crystal Fire: The Birth of the Information Age. 1997.

Cell Phone Woes

About 11:30 a.m. on April 29, 1998, a friend of mine who works in the city of Newark, Ohio, called me from his cell phone to ask if I could reach any of his Internet-connected hosts. Newark is very near Columbus. The area around it probably has a "daytime" population of about 75,000. I confirmed that his entire network was unreachable, and even his ISP's connection between Columbus and Newark was down. He then told me that since 10:30 a.m., no one in Newark had any telephone service whatsoever. No one could take credit cards. Some stores' POS terminals would not work. No ATM's were working. Even the digital cellular network in Newark became unusable--probably due to overload, as a result of picking up some of the slack. There was some restoration of local phone service around 2 p.m., but no long distance service, even "close" long distance, such as to Columbus. At 6:45 p.m., service was finally restored. The problem? One communications tower had been taken out of service due to some sort of accident.

Computer glitch turns traffic ticket into sex conviction

BOZEMAN, Mont. (April 29, 1998 1:55 p.m. EDT) -- Cody Johnston is suing a weekly newspaper and the court system for libel after a computer glitch transformed a report of a traffic ticket into a conviction for deviate sexual conduct. Johnston had been fined $195 for a commercial trucking weight violation. But the list given to the newspaper contained the sex charge, which covers homosexual acts and bestiality. [Source: *Nando Times of Japan (www.nando.net), courtesy of Keith Rhodes. PGN Abstracting]

An Invitation to Lose Ships in a Vast Ocean

[Risks Digest 19.76] ANNAPOLIS, Md. (AP) -- The computer has sunk the ancient art of celestial navigation at the Naval Academy. In the new academic year, midshipmen will no longer have to learn the often tedious task of using a wedge-shaped sextant to look at the stars and plot a ship's course. Instead, the academy is adding a few extra lessons on how to navigate by computer. Naval officials said using a sextant, which is accurate to a three-mile radius, is obsolete. A satellite-linked computer can pinpoint a ship within 60 feet. [...]

Schiphol Airport Out of Action

[Risks Digest 19.85] At about 13:00 on 11 July 1998, one of the busiest days in the year for Schiphol, the Amsterdam Airport, a computer malfunction stopped just about all air operations. According to the Dutch newspapers, some malfunction in the Triple A (AAA) system in air-traffic control blanked all screens, forcing the airport to put all traffic 'on hold'. It took about 30 minutes to get the system back up, and the rest of the day to clear the resulting mess. According to a spokesperson, "You can't use full capacity at once, you have to build that up." The Triple A system has been in use since 1 June 1998. Some more interesting quote that appeared in the newspaper: ``Stories about ripped-apart cables are nonsense. The defect has been fixed, and we're not afraid it will happen again.'' [At least two risks in this quote: what do you tell your customers, and what do you mean, you're not afraid it will happen again?]

Air Bags Caused 130 Injuries

[Risks Digest 19.85] After 130 reported injuries due to gratuitous deployment of automobile airbags, General Motors and is recalling almost one million cars (1996 and 1997 Chevy Cavaliers and Pontiac Sunfires, and 1965 Cadillac DeVilles, Concours, Sevilles, and Eldorados). The Cavaliers and Sunfires have a sensor calibration problem that enables the air bags to inflate even under normal conditions on paved roads (perhaps an object bouncing up against the underside); the fix involves a little software reprogramming. The Cadillac air bags can deploy when there is moisture on the floor under the driver's seat, where the computer is located. A fix might involve waterproofing the computer box. [Source: AP item, 14 July 1998, PGN Abstracting]

Armageddon Yet Again

[Risks Digest 19.85] A Norwegian weather research rocket was mistaken for an American Trident ballistic missile in 1995. This was due to "the poor state of the [Russian] early warning systems." After the missile was spotted, a ten-minute countdown began toward a retaliatory strike on the US. The Strategic Rocket Forces were commanded to get ready for the next order, which would have been the launch order.

USS Yorktown Adrift Due to NT Error

[Risks Digest 19.88] The Navy's Smart Ship technology is being considered a success, because it has resulted in reduced manpower, workloads, maintenance and costs for sailors aboard the Aegis missile cruiser USS Yorktown. However, in September, 1997, the Yorktown suffered a systems failure during maneuvers off the coast of Cape Charles, VA., apparently as a result of the failure to prevent a divide by zero in a Windows NT application. The zero seems to have been an erroneous data item that was manually entered. Atlantic Fleet officials said the ship was dead in the water for about 2 hours and 45 minutes. A previous loss of propulsion occurred on 2 May 1997, also due to software. Other system collapses are also indicated.

This caused one Risks reader to purpose an alternative to world disarmament: "Ideally, the next few generations of operating systems will end up being so incompatible with legacy systems that no country anywhere will be able to wage war." Robin Sheppard

The incident was much discussed on comp .risks.

Crummy Software Loses Company One-Half Billion Dollars

["News Reader" bitslicer@hotmail.com]

Patrick Doyle wrote in message ... In article Nancy Mead wrote:

It is amazing to me that after 30+ years in this business, correctness is not expected or provided in our software, and that serious practitioners feel compelled to justify a focus on correctness What if some software companies started taking legal responsibility for the correct functioning of their programs? I think that would become popular with users pretty quickly.

It's already happening in other areas of software...

After numerous citywide failures related to software, PrimeCo called Motorola and told them come pick up their $500 million (yes, 1/2 Billion) of cellular infrastructure equipment so it could be replaced with Lucent. From what I heard, cell service would be knocked out for minutes on end in cities due to a software glitch.

So I guess you could say Motorola took responsibility for their actions. Funny, too, because they're some of the world's most vocal CMM proponents. I wonder if the crew that worked on this particular hack of software tended towards to 5.0 side or 1.0 side? Is there a rung on the ladder that discusses how to graciously refund money and clean up after you've removed equipment from a customer site? :)

I can just see mangers saying to their manages, "I don't see why we failed. We followed all the rules."

This, unfortunately, happens all the time. A place I worked at was so heavily pushing TQM (Total Quality Management) that *actual* quality got sacrificed as people were so busy producing quality reports all the time, that the actual useful work suffered. ("I don't have time to investigate that bug because i'm too busy writing a report on why that bug couldn't have happened in the first place. Then I have to go to a meeting and present all sorts of charts and diagrams about the quality procedures i'm following.") Also, managers that hire "bad" programmers who happen to know all the rules, vs. "good" programmers that don't, therefore the "bad" programmer is better. (Another place I am familiar with had a programmer that could "play the game" quite well, but didn't produce more than about 20 lines of code a *year*. I wish that were an exaggeration. Actual output was not one of the metrics used for evaluations. Consequently, good people were "let go" and bad people were kept; based on how well they could "play the game")

- Know Future The avalanche has started, it is too late for the pebbles to vote.

Satellites Give Wrong Data on Global Warming for 20 Years

[Risks Digest 19.91] Despite rampant evidence of global warming, satellite evidence over the past 20 years has been suggesting that the earth's atmosphere is cooling. Frank J. Wentz and Matthias Schabel, scientists at Remote Sensing Systems in Santa Rosa, California, have published a study in Nature that concludes that atmospheric temperatures have in fact increased, and that the previous satellite data was erroneous -- in part because orbiting thermometers lose altitude and in part because of the computers. The argument based on the global warping of orbits continues the global warring among scientists and policy makers as to whether there really is global warming that merits global warning. The Washington Post, 13 Aug 1998.

If you happen to have particular faith in satellite data, please don't forget Bill McGarry's report in RISKS-3.29 about how the very clear early warning of ozone-layer depletion over the South Pole was ignored for many years because the dramatic data values were rejected by the software — because they were so extreme. That case is one of the rare examples of a bounds check that should have been missing (as opposed to all of the missing bounds checks that we report in RISKS as causing security flaws or other problems).

Date: Thu, 3 Sep 1998. From: zowie@urania.nascom.nasa.gov (Craig DeForest). Subject: Near-loss of SOHO spacecraft attributed to operational errors. RISKS readers will remember that on 24 Jun 1998, the international billion-dollar SOHO satellite lost contact with Earth and began spinning in an uncontrolled fashion. [RISKS-19.87,90] An Investigative Board was established by ESA and NASA to determine the cause of the disruption. That board has now released its final report It is a very interesting case study of a failure in complex systems management. The proximal cause of the loss was a misidentification of a faulty gyroscope: two redundant gyroscopes, one of which had been spun down(!), gave conflicting signals about the spacecraft roll rate, and the ops team switched off the functioning gyro. The spun-down gyro became SOHO's only information about roll attitude, causing SOHO to spin itself up on the roll axis until the pre-programmed pitch and yaw control laws became unstable. This was the last in a series of glitches in the operational timeline on the 24th of June; the full story is available at the above web site. There were many other factors leading to the loss. The report reads like a roll call of well-known RISKy behaviors, including a staffing level too low for periods of intensive operations; lack of fully trained personnel due to staffing turnover; an overly ambitious operational schedule; individual procedure changes made without adequate systems level review; lack of validation and testing of the planned sequence of operations; failure to carefully consider discrepancies in available data; and emphasis on science return at the expense of spacecraft safety. The board "strongly recommends that [ESA and NASA] proceed ... with a comprehensive review of SOHO operations ... prior to the resumption of > SOHO normal operations". Contact with SOHO has since been re-established, and -- following thawing of the frozen hydrazine rocket fuel on board -- full attitude control is expected within a couple of weeks, allowing recommissioning and testing of the spacecraft and instruments.

More Nuclear Holocaust Averted

[risks 20.39; 24 Sep 1998] The Daily Express today (24-SEP-1998) reports - taken from Kommersant Vlast magazine - on an event that took place almost 15 years ago, at 21:00 BST, 25-SEP-1983. Computer screens for the early warning system at the Serbukov-15 base, indicated that a Minuteman ICBM was en route to Moscow, followed seconds later by other missiles. If the threat had been confirmed within 10 minutes, and Soviet leader Yuri Andropov informed of this, a counter-strike would almost certainly have been issued. However, Lieutenant-Colonel Satnislav Petrov, "armed with a creaking computer" was responsible for analysing data from the Oko satellite, Kosmos 1382, and knew that it was subject to faulty readings caused by radiation damage. He also knew that the launch was not confirmed by ground-based warning systems, and did not alert the Kremlin. An inquiry commission later came away "terrified" at the appalling dangers created by the defective early warning system. Mark Corcoran, VMS Systems/Site/Security/Comms & Network Manager, Softel Ltd.

More Rocket Disasters

Doomed Titan 4B Milstar launch

[RISKS13 May 1999] From: "Keith A Rhodes" rhodesk.aimd@gao.gov. Subject: Faulty software doomed Titan 4B Milstar launch (RISKS-20.36) The 30 Apr 1999 improper Milstar orbit was the result of Lockheed Martin engineers loading flawed software into the Titan/Centaur rocket. The flaw was not detected despite extensive prelaunch "verification". The report will be published next week in Aviation Week and Space Technology." The software was verified at Lockheed Martin Astronautics in Littleton, Colo. [Source: Article by Todd Halvorson, Florida Today, 8 May 1999]

Ariane 5

The Ariane 5 rocket was forced to self-destruct less than one minute into its maiden flight. As it happened, the board appointed by CNES (Centre national des études spatiales) and ESA (the European Space Agency) to investigate the failure was chaired by applied mathematician Jacques-Louis Lions of the Collège de France. The story of the uncovering of the sofwarware error is summarized here., based on an English translation of parts of the board’s report, which was completed within six weeks of the explosion.

A Look At the Logic of It.

Mars Polar Orbiter

September 30, 1999. NASA suggests that the loss of the Mars Polar Orbiter is due to a mismatch of physical units: some developers used the British (feet-pounds-secs) system and another used the metric (meters-kilograms-sec) system. News Article

Shuttle Columbia Disaster

The Complete Columbia Report

Why Things Are The Way They Are


To: Members of GCFL <gcfl@gcfl.net>
Subject: [GCFL] The Impact of the Roman Empire on Space Shuttle Design
Date: 20 Oct 1999 08:00:14 -0000

A useless fact (with a twist) about technology:

The US standard railroad gauge (distance between the rails) is 4
feet 8.5 inches. That's an exceedingly odd number.

Why was that gauge used? Because that's the way they built them in
England, and English expatriates built the US railroads.

Why did the English build them like that? Because the first rail
lines were built by the same people who built the pre-railroad
tramways, and that's the gauge they used.

Why did 'they' use that gauge then? Because the people who built the
tramways used the same jigs and tools that they used for building
wagons, which used that wheel spacing.

Okay! Why did the wagons have that particular odd wheel spacing?
Well, if they tried to use any other spacing, the wagon wheels would
break on some of the old, long distance roads in England, because
that's the spacing of the wheel ruts.

So who built those old rutted roads? The first long distance roads
in Europe (and England) were built by Imperial Rome for their
legions. The roads have been used ever since. And the ruts? Roman
war chariots first made the initial ruts, which everyone else had to
match for fear of destroying their wagon wheels and wagons. Since
the chariots were made for, or by Imperial Rome, they were all alike
in the matter of wheel spacing.

Thus, we have the answer to the original question. The United States
standard railroad gauge of 4 feet, 8.5 inches derives from the
original specification for an Imperial Roman war chariot.

Specifications and bureaucracies live forever. So, the next time you
are handed a specification and wonder which horse's rear came up
with it, you may be exactly right. Because the Imperial Roman war
chariots were made just wide enough to accommodate the back ends of
two war-horses.

And now, the twist to the story...

There's an interesting extension to the story about railroad gauges
and horses' behinds. When we see a Space Shuttle sitting on its
launch pad, there are two big booster rockets attached to the sides
of the main fuel tank. These are solid rocket boosters, or SRBs.
Thiokol makes the SRBs at their factory at Utah. The engineers who
designed the SRBs might have preferred to make them a bit fatter,
but the SRBs had to be shipped by train from the factory to the
launch site. The railroad line from the factory had to run through a
tunnel in the mountains. The SRBs had to fit through that tunnel.
The tunnel is slightly wider than the railroad track, and the
railroad track is about as wide as two horses behinds.

So, the major design feature of what is arguably the world's most
advanced transportation system was determined by the width of a
Horse's [rear]!

Think about it!

Received from L. Rodney Ford via Doug Taylor.

References

[bentle93] John P. Bentley. An Introduction to Reliability and Quality Engineering. Longman Scientific & Technical (John Wiley). 1993.
[gao92] GAO. Patriot Missile Defense: Software Problem Led to System Failure at Dhahran, Saudi Arabia. GPO: GAO/IMTEC-92-26. 1992.
[gold81] Michael Gold. Who Pulled the Plug on Lake Peigneur. Science 81. Nov. 56-63.
[johnson85] Clarence L. ``Kelly'' Johnson and Maggie Smith. Kelly: More Than My Share of It All. Smithsonian Institution Press. Ch 9. 1985.
[lee92] Leonard Lee. The Day the Phones Stopped. New York: Primus-Donald I. Fine, Inc.1992.
[leveson9] N. Leveson. Software System Safety. In STAR '93" Ontario at Darlington, Ontario. 1993.
[marshall92] Eliot Marshall. Fatal Error: How Patriot Overlooked a Scud.Science. Mar. 13. 1992.
[mowen93] John C. Mowen. Judgment calls : high-stakes decisions in a riskyworld. NY: Simon & Schuster. 1993.
[neumann95] Peter G. Neumann. Computer Related Risks. Addison-Wesley. 1995.
[perrow84] Charles Perrow. Normal Accidents. Basic Books. 1984.
[peterson95] Ivars Peterson. Fatal Defect: Chasing Killer Computer Bugs. New York: Times Books. 1998.
[rsre88] Defence Research Agency. Royal Signals and Radar Establishment Report on NATO Software. Ministry of Defence, U. K. 1988
[skeel92] Robert Skeel. Roundoff Error and the Patriot Missile. Science. July, 1992.