No matter how many people learn it, and how many online schools open up for it, working on software is hard. Whether it's a mobile app, or a website, or a spacecraft rocketing across the planet, they all need software to tell it what to do next and to make sure it does it efficiently and safely. That's not always the case however. Sometimes this software tells a satellite to do something wrong, millions of times per second. Sometimes these bugs are just an inconvenience but other times they can cost hundreds of millions of dollars.
Let's take a look at a few of the worst software blunders in no particular order.
This one just happened! First, a little back story on the LightSail, because it's pretty cool. The LightSail started as a Kickstarter campaign by non other than Bill Nye and The Planetary Society. It's been successfully funded and it's set for an official primary launch in 2016. It's essentially a Cube shaped satellite, or CubeSat, with an expanding sheet of Mylar that will use the momentum from the Sun's photons to propel itself through space. The idea is that we don't need any pesky rocket fuel to get to other planets. Just recently though, a prototype was launched into good old space to test out its capabilities. And it's not doing so good.
Looks like a computer bug has made the LightSail unable to communicate with mission control.
The problem? It looks like the LightSail was launched with an older version of its Linux based OS. Ouch. It's suppose to send back data as it circles the planet, and then store said data locally, but it looks like this file has grown too big for its own good and as such, it crashed. It looks like 32MB was the limit here. You can't just hold the power button on this one unfortunately either. Attempts to restart the system have already been tried however, but it look likes like our sun fairing friend is refusing to accept any commands.
Northeast Blackout Of 2003
This is the worlds seconds largest mass blackout in history, which is saying something. Was it a hardware issue? Afraid not. Did a technician accidentally cut through a cable? Not even close. Once again, this could be attributed to a software failure. This blackout affected an estimated 55 million people spread across 8 U.S. states and Ontario Canada. The main cause was a faulty alarm system that didn't alert operators to redistribute power after overloaded transmission lines hit foliage. So to save face, we can blame the foliage. Pesky things. And actually the management of the trees was in fact deemed one of the factors after the investigation was completed.
To be more accurate however, the actual was a race condition present in Unix based system that was being used. And that's all I'm going to say on this, because it turns out that the electrical grid is incredibly complicated.
The Mars Climate Orbiter
In 1998 NASA launched the Mars Climate Orbiter to study the Martian climate and surface changes. In 1999 the Orbiter lost communication due to the fact that it pretty much disintegrated on Mars. Yeap, that'll do it. The main culprit however was something much simpler. It's a pretty common bug in software development. A piece of software supplied by Lockheed Martin returned results in Imperial Units, while another system that was supplied by NASA used metric units, which are what the specifications called for. The result was that the orbiter ended up flying much closer to the atmosphere than it was capable of surviving, and thus turned to dust.
The results from this however was a change in management and new policies and procedures to make sure that issues like this get caught somewhere in the development timeline. Which is always a plus, and something that should be in place from day 1.
The Therac 25
The Therac 25 was a radiation therapy machine that was used during the 80's. It is believed to have been involved in at least 6 incidents in which it gave patients massive doses of radiation due to faulty programming. Essentially, the high power electron beam was activated instead of the lower power one, without a beam spreader in place. Older models had hardward fail-safes in place to prevent such things, but the Therac-25 instead relied on software fail-safes, which were susceptible to a race condition bug. We saw that bug make an appearance above. It's a pretty common bug in the world of software development. There were a ton of problems with this machine though, and it stemmed beyond just software. There was no software reviews in place, the UI was confusing and made little sense to the operators, the operators just ignored the errors and moved forward, etc etc.
The Therac-25 case is taught in colleges so that future software engineers can see the possible results and the responsibility that comes with working on such complex systems.
Knight Capital Groups
How does losing over $400 million sound? Because Knight Capital Group managed to do that in about 45 minutes in 2012. The problem? Bad code and a bad deployment looks like. Old code was reused and someone forgot to copy over some of it to one of the 8 servers. Which of course means that this thing is going to run out of control and do the unexpected. And it did. Knight Capital ended up sending millions of child orders to 212 parent orders resulting in, well, a $400 million loss pretty much. The company managed to stay afloat by getting a last minute $400 million investment by half a dozen investors.
Here's the thing. All of these bugs could of been avoided, in a perfect world. In our world however, no such luck. Managers don't manage well, and programmers do whatever they can in whatever short deadline they're given. I've seen plenty of bugs in my day. I've seen all of the images on a website vanish overnight never to be seen again, and I've seen reports so incorrect they might as well have belonged to another company. Programming is hard. And it's a serious job. It's one of the few jobs where one mistake can affect millions of people in a second. On that note, it's also fun and bugs are going to happen ^_^ So always triple check your work.
Enjoy this post?