As a software developer, I’m interested and concerned about the state of our software. Software does not always work and even critical bugs often take hours or days before they are fixed.
The cost on society of faulty software is huge.
6 hours to fix a 911 outage
In my research on this topic, the first incident I’d like to highlight is from 2017.
According to the Atlantic’s article on “The Coming Software Apocalypse”, it took 6 hours to fix a bug that brought down the 911 system of Washington state, USA. The fix was changing the value in a variable used as a counter. So, in this case, a super urgent bug took 6 hours to be fixed. For those of you who are not in the United States, 911 is the emergency number. During this outage, people in emergencies were unable to reach police and ambulance services by dialing this number.
I wonder about accountability vs technical complexity here. The failure happened at midnight. Would it have taken as long to fix if it had happened during business hours? Did it take that long because of the complexity of determining the cause, or was it just because the person with the system knowledge was sleeping? The article does not provide those details. But complexity is discussed.
2 Days to fix a diabetes software outage
Over the 2019 Thanksgiving holiday, the company Dexcom experienced an outage in their servers. Dexcom’s software is used by people with diabetes to allow their family to keep track of their blood glucose. This outage meant that parents of kids with diabetes didn’t receive alerts about hypoglycemia, a low blood glucose event. (Hypoglycemia, left untreated, can cause coma and death.) According to this article at CNET, the outage lasted OVER 2 DAYS, as it started on Saturday morning and on Monday there were still intermittent outages. According to Dexcom, there was no recent code change that caused the incident, which made it hard to troubleshoot.
- The Coming Software Apocalypse – Discusses software complexity as a cause of the 911 outage.