8.6.0.14

Reading 7: Safety

This assignment is due on Sunday, November 27 at 11:59pm. Don’t bullshit.

Exercise 1. Read The Therac-25: 30 years later by Leveson. Focus on what to do.
  • When you follow the link above with your browser, you should see Leveson’s article, as well as a button “<” in the upper-right corner. Use the “<” button to expand the annotation sidebar.

  • You may need to log in to Hypothesis, using the account you created in Reading 1: Who can define the bigger number?.

  • **Pay attention to this next step**. This is a common mistake that results in many zeros on the assignment. After logging in, you are not done. You still have to expand the drop-down menu “Public” in the top right sidebar, make sure it says “my groups”, and change it to our course group “211”. You belong to this group because you used the invite link in Reading 1: Who can define the bigger number?. If you don’t post to this group, then other students won’t see your annotations, and you won’t get credit.

Exercise 2. Find a place where Leveson says what to do. Carefully select it, and Annotate it like this:
  • In the future, when …

  • I will …

  • instead of ….

Exercise 3. Find one place in the article where you were confused, uncertain, or curious. Carefully select exactly the relevant passage, and Annotate it with your question and what you have done towards answering it.
  • Make your question clear, descriptive, and specific.

  • Don’t be too brief, terse, or vague. Don’t just say “What’s this” or “I don’t understand”.

  • Don’t just summarize.

Exercise 4. Once you have added your annotations, respond to someone else.

Optional: read more about Therac-25 in Appendix A of Leveson’s book “Safeware: System Safety and Computers”. Here’s one key good part:

Failure to Eliminate Root Causes. One of the lessons to be learned from the Therac-25 experiences is that focusing on particular software design errors is not the way to make a system safe. Virtually all complex software can be made to behave in an unexpected fashion under some conditions: There will always be another software bug. Just as engineers would not rely on a design with a hardware single point of failure that could lead to catastrophe, they should not do so if that single point of failure is software.

The Therac-20 contained the same software error implicated in the Tyler deaths, but this machine included hardware interlocks that mitigated the consequences of the error. Protection against software errors can and should be built into both the system and the software itself. We cannot eliminate all software errors, but we can often protect against their worst effects, and we can recognize their likelihood in our decision making.

One of the serious mistakes that led to the multiple Therac-25 accidents was the tendency to believe that the cause of an accident had been determined (e.g., a microswitch failure in the case of Hamilton) without adequate evidence to come to this conclusion and without looking at all possible contributing factors. Without a thorough investigation, it is not possible to determine whether a sensor provided the wrong information, the software provided an incorrect command, or the actuator had a transient failure and did the wrong thing on its own. In the case of the Hamilton accident, a transient microswitch failure was assumed to be the cause even though the engineers were unable to reproduce the failure or to find anything wrong with the microswitch.

In general, it is a mistake to patch just one causal factor (such as the software) and assume that future accidents will be eliminated. Accidents are unlikely to occur in exactly the same way again. If we patch only the symptoms and ignore the deeper underlying causes, or if we fix only the specific cause of one accident, we are unlikely to have much effect on future accidents. The series of accidents involving the Therac-25 is a good example of exactly this problem: Fixing each individual software flaw as it was found did not solve the safety problems of the device.