Accident Avoidance Pattern: Improving Knowledge for Safety critical

Transcript

Scuola Politecnica e delle Scienze di Base
Corso di Laurea Magistrale in Ingegneria Informatica
Tesi di Laurea Magistrale in Impianti di Elaborazione
Accident Avoidance Pattern: Improving Knowledge
for Safety critical domains
Anno Accademico 2013/2014
Relatore
Ch.mo Prof. Domenico Cotroneo
Correlatore
Ch.mo Dr. Roberto Pietrantuono
Ch.mo Ing. Fumio Machida, NEC (Japan)
Candidato
Mirko Napolano
matr. M63000382
A mamma e papá,
che mi hanno tanto supportato,
e tanti sacrifici hanno fatto
per permettermi di arrivare fin qui.
Acknowledgements
Ringrazio innanzitutto il mio relatore Prof. Domenico Cotroneo, che mi ha supportato durante il lavoro di tesi e mi ha concesso la possibilitá di svolgere un tirocinio
importante presso i laboratori della NEC Corporation in Giappone. Soprattutto, ha
rappresentato per me una guida ed un riferimento costante.
Ringrazio il mio correlatore Ing. Roberto Pietrantuono, che mi ha gentilmente seguito nella preparazione della tesi mettendo a disposizione la sua esperienza ed il
suo tempo.
Ringrazio inoltre il mio supervisor presso la NEC Ing. Fumio Machida, che mi
ha accolto nel suo gruppo di ricerca e mi ha permesso di avviare il lavoro di tesi. E’
stato per me un esempio di professionalitá e gentilezza.
Grazie ai miei compagni d’universitá, in ordine Dario, Fabrizio, Gaetano, Giovanni,
Mario, Pierluca e Raffaele, con cui ho condiviso gioie, ansie, progetti ed ore piccole
davanti a Na tazzulell ’e café.
Grazie agli irriducibili amici di classe, che dopo tanti anni rendono ancora le giornate piú allegre e leggere. Grazie in modo particolare a Luigi, Davide, Giuseppe ed
Emanuele, perché con loro é Tutta ’nata storia.
Grazie a Marika, fidata amica che mi comprende e che é sempre presente, nel bene
e nel male.
Grazie alla mia famiglia, che mi ha permesso d’inseguire i miei sogni, credendo in me.
Grazie a Marialuisa, anima pura che mi ha accompagnato in questo cammino e
che é sempre stata al mio fianco. Su di lei, Dubbi non ho.
Spero di essere stato all’altezza della vostra stima e di esserlo in futuro.
III
Contents
Introduction
10
1 Accident knowledge in safety domains
1.1 Safety critical systems . . . . . . . . .
1.2 Accident investigation . . . . . . . . .
1.2.1 NTSB investigation process . .
1.3 Considerations . . . . . . . . . . . . .
13
13
14
16
18
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Assurance case, GSN and patterns
2.1 Safety case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 ISO/IEC 15026 Standard: Systems and Software Assurance . . . .
2.2.1 Part 1 & Part2: Formalization of assurance case concepts and
structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Part 3: System Integrity Levels . . . . . . . . . . . . . . . .
2.2.3 Part 4: Assurance in the life cycle . . . . . . . . . . . . . . .
2.3 Goal Structuring Notation (GSN) . . . . . . . . . . . . . . . . . . .
2.3.1 The description of the assurance case . . . . . . . . . . . . .
2.4 Safety case patterns . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Representation through GSN . . . . . . . . . . . . . . . . .
2.4.2 Documentation . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Safety case lifecycle . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Reuse of safety cases . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
20
. 20
. 22
.
.
.
.
.
.
.
.
.
.
.
.
23
26
27
29
31
32
36
37
39
42
42
44
IV
Corso di Laurea in Ingegneria Informatica
Accident Avoidance Pattern:
Improving Knowledge for Safety critical domains
3 ECFMA and Accident Avoidance Pattern: the methodology
3.1 The methodology . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Event and Causal Factor Mitigation Analysis (ECFMA) . . . .
3.2.1 The standard ECFA . . . . . . . . . . . . . . . . . . . .
3.2.2 The enhanced ECFMA . . . . . . . . . . . . . . . . . . .
3.3 Accident Avoidance Pattern . . . . . . . . . . . . . . . . . . . .
3.3.1 Construction and formalization . . . . . . . . . . . . . .
4 Case studies
4.1 DART spacecraft collision . . .
4.1.1 Accident and system role
4.1.2 ECFMA analysis . . . .
4.1.3 Assurance case . . . . .
4.2 Multistate 911 outage . . . . . .
4.2.1 Accident and system role
4.2.2 ECFMA analysis . . . .
4.2.3 Assurance case . . . . .
4.3 Discussion on the methodology
. . . . .
context
. . . . .
. . . . .
. . . . .
context
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
47
49
49
53
55
58
.
.
.
.
.
.
.
.
.
63
63
63
65
72
82
82
84
90
97
Future work
103
Conclusion
104
A Accident Avoidance Pattern formalization
105
V
List of Figures
1.1
ECF chart example . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
Example of safety case [12] . . . . . . . . . . . . . .
Assessment of integrity levels [8] . . . . . . . . . . .
GSN basic elements [12] . . . . . . . . . . . . . . . .
GSN example [12] . . . . . . . . . . . . . . . . . . . .
Use of Public Indicator . . . . . . . . . . . . . . . . .
Example: Functional Decomposition Pattern [14]
Safety-case lifecycle [17] . . . . . . . . . . . . . . . .
Case-based reasoning for safety case [19] . . . . . .
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
The methodology . . . . . . . . . . . . . . . . .
ECF chart elements [4] . . . . . . . . . . . . . .
ECF analysis [4] . . . . . . . . . . . . . . . . . .
ECFMA example . . . . . . . . . . . . . . . . . .
Example: Hazardous Contribution Software
ment [23] . . . . . . . . . . . . . . . . . . . . . . .
Example: Hazard Avoidance Pattern [14] . .
Six step process . . . . . . . . . . . . . . . . . .
Accident Avoidance Pattern . . . . . . . . . . .
4.1
4.2
4.3
4.4
4.5
ECFMA
ECFMA
ECFMA
ECFMA
ECFMA
chart
chart
chart
chart
chart
of
of
of
of
of
DART
DART
DART
DART
DART
collision:
collision:
collision:
collision:
collision:
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . .
. . . . .
. . . . .
. . . . .
Safety
. . . . .
. . . . .
. . . . .
. . . . .
bird’s view . . . .
initial events . .
middle events . .
upper conditions
final events . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
28
33
34
35
38
43
44
. . . .
. . . .
. . . .
. . . .
Argu. . . .
. . . .
. . . .
. . . .
.
.
.
.
48
50
51
54
.
.
.
.
56
57
59
60
.
.
.
.
.
.
.
.
.
.
66
67
68
69
70
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
VI
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18
4.19
4.20
4.21
4.22
Assurance case of DART accident: bird’s view . . .
Assurance case of DART accident: top-level claim .
Assurance case of DART accident: first excerpt . .
Assurance case of DART accident: second excerpt .
Assurance case of DART accident: third excerpt . .
Assurance case of DART accident: fourth excerpt .
Assurance case of DART accident: last excerpt . . .
Washington NG911 Transition architecture [28] . .
ECFMA chart of 911 outage: bird’s view . . . . . .
ECFMA chart of 911 outage: initial events . . . . .
ECFMA chart of 911 outage: accident . . . . . . . .
ECFMA chart of 911 outage: post-accident events .
Assurance case of 911 outage: bird’s view . . . . . .
Assurance case of 911 outage: top-level claim . . . .
Assurance case of 911 outage: first excerpt . . . . .
Assurance case of 911 outage: second excerpt . . . .
Assurance case of 911 outage: third excerpt . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
73
74
76
77
78
79
80
83
85
87
88
89
91
92
94
95
96
VII
List of Tables
2.1
2.2
Documentation of a safety case pattern, part 1 . . . . . . . . . 40
Documentation of a safety case pattern, part 2 . . . . . . . . . 41
3.1
Pattern catalogue taken from reference [20] . . . . . . . . . . . 55
4.1
4.2
4.3
4.4
DART’s collision: list of identified hazards . . . . . . . . . .
911 outage: list of identified hazards . . . . . . . . . . . . .
DART’s collision: correspondence with recommendations
911 outage: correspondence with recommendations . . . .
.
.
.
.
.
.
.
.
71
86
97
98
VIII
Acronyms
DART Demonstration of Autonomous Rendezvous Technology
ECFA Event and Causal Factor Analysis
ECFMA Event and Causal Factor Mitigation Analysis
GSN Goal Structuring Notation
ISO/IEC 15026 IEEE Standard: Systems and Software Assurance
MIB Mishap Investigation Board
MUBLCOM Multiple Paths, Beyond-Line-of-Sight Communications
NASA National Aeronautics and Space Administration
NG911 Next Generation 911
NTSB National Transportation Safety Board
PSHSB Public Safety and Homeland Security Bureau
SIL Safety Integrity Level
IX
Introduction
In traditional safety critical domains, like avionics, aerospace, automotive and railway, computer systems are used intensively to perform regular operations and accomplish objectives. Moreover, the use of such systems to monitor and manage
critical functionalities has become important for other kinds of infrastructures, such
as emergency communication networks, gas pipeline and nuclear plants, in which
safety must be guaranteed.
System providers need to reduce risks of system failures as much as possible, since
such failures can lead to catastrophic consequences, like infrastructure damages,
injuries and business losses. However, even if engineers follow safety standards
and apply assessed methodologies during the system design, accidents can always
happen. Not only, similar accidents often happen again. So, it is critical to analyze
the events and assess the causes in order to avoid the occurrence of the same accident.
Whenever a relevant accident turns up, public agencies, which are responsible
for the safety in that domain and geographic area, investigate on the mishap to
reconstruct events and causes. The process lasts many months in which all the
possible stakeholders are involved. As result of this work, the investigative body
releases a final report along with a list of safety recommendations. This list contains
some guidelines that the involved companies and regulators should apply to mitigate
10
or eliminate the identified hazards.
Yet, the main problem of such recommendations is that they are released for
the stakeholders in an unstructured manner. Since this information is also useful
for other companies working in the same domain, it is difficult for them to improve
effectively their knowledge about a similar accident.
The goal of this thesis is to present a methodology to analyze the mishap from
the final reports, extract the causes and provide a structured way to present the
achieved knowledge.
For the accident analysis, it has been used Event and Causal Factor Analysis
(ECFA). It is a tool widely used by investigative agencies to describe events and
conditions and identify accident causes. We have introduced an enhancement that
provides a logical relationship between causes and possible solutions by producing
Event and Causal Factor and Mitigation Analysis (ECFMA).
After the analysis, we have used an “assurance case” to argue that the solutions
are adequate to mitigate the discovered hazards. An assurance case is a structured argumentation supported by a body of evidence intended to justify a system
property. In order to provide an effective, systematic structure, a new assurance
case pattern, namely Accident Avoidance Pattern, has been created to elucidate the
accident knowledge by arguments and evidences. This approach allows engineers
belonging to another company to reuse this accident knowledge in a more understandable and effective way for improving design and operation. In order to evaluate
the methodology, it has been applied with reference to two case studies concerning
two different domains, aerospace and communication network.
Part of the thesis has been developed during a three-month internship at NEC
Laboratory for Analysis of System Dependability (LASD) in Kawasaki City, Japan,
where the author had the possibility to define the methodology.
11
The thesis is structured as follows:
Chapter 1 provides an overview of safety critical systems and accident investigations as performed by most of investigative boards highlighting the limitations of
this approach in improving the accident knowledge.
Chapter 2 introduces assurance case and GSN, which are basic preliminary concepts within the context of the proposed methodology. The chapter also describes
the use of safety case patterns as a way to instantiate repetitively a structured and
successful argumentation about the safety of system.
Chapter 3 illustrates the proposed methodology. Specifically, it shows how to
perform the analysis through ECFMA and how to improve the accident knowledge
through the use of the Accident Avoidance Pattern.
Chapter 4 describes two case studies used to evaluate the approach. They deal
with different domains, aerospace and communication network, in order to show how
it is possible to apply the methodology in different contexts. In the last part of this
chapter there is a discussion on the methodology. We list advantages and drawbacks
of the approach by focusing on the results of the case studies.
At the end of the discussion there is a look at the future by listing the possible
improvements of the approach and the pattern.
12
Chapter 1
Accident knowledge in safety domains
1.1
Safety critical systems
Computer systems are employed more and more in contexts where their possible
failure can have catastrophic consequences. Software and hardware systems control
several critical infrastructures, without whose support it would not be possible to
manage to accomplish their task because of their extension and complexity.
Traditionally, some critical domains exist in which computer systems have always
played a fundamental role in controlling the process. In domains such as aviation,
aerospace and railway, the computer architecture must be ready to react to unforeseen events and conditions so that a mishap can be avoided. If the system does not
react correctly and/or quickly, it is very likely that an accident will happen.
With the increasing complexity of social infrastructures like nuclear plants, power
grid and communication networks, the role of IT systems has become crucial. Originally, computers have been used to monitor a process, such as the flow of water
in a hydroelectric power plant or the temperature in a nuclear plant, but other
capabilities like remote control or automated scheduled procedures have not been
13
implemented in the past. Nowadays, these kind of functionalities can be also provided in such infrastructures by integrating IT systems with mechanical or hydraulic
systems.
Since these complex systems have a great impact on both people and business
processes, it is very important to minimize the probability that a mishap in these
environments occurs. Such a kind of system is called safety-critical system. It is
defined as “a system whose failure or malfunction can cause damage to people (death
or injuries), damage to properties, environmental harm and/or loss of money (direct
or indirect)”, while safety is “a measure of the continuous delivery of service free
from occurrences of catastrophic failures” [1].
Safety is an internal property of the system, but safe system can never be guaranteed. However, if risks of catastrophic failures can be controlled and brought
within the acceptable limits, then such a system can be considered safe. Although
several techniques and methodologies are applied to control and limit such risks,
accidents can always occur, even not so rarely. Some famous examples are the 2011
Fukushima Daiichi nuclear disaster, the 2003 Italy blackout or the 2009 Washington
Metro train collision.
1.2
Accident investigation
As soon as an accident occurs, it is very important for system providers to understand
what was going on in order to improve the safety knowledge and avoid the occurrence
of similar accidents. If the mishap is relevant enough regarding damages or injuries,
the analysis will be conducted through a detailed investigation.
After a severe accident has happened in a certain geographic area, the indepen-
14
dent public agency that has a responsibility on the concerned area will start an
investigation to understand the facts and determine the causes. Examples of public
agencies are the American National Transportation Safety Board (NTSB), European
Aviation Safety Agency (EASA) and Japan Transport Safety Board (JTSB). Some
of them, such as NTSB, investigate on several different transport domains (aviation,
highway, intermodal, marine, pipeline, railway), while other organizations analyze
the events occurred in a specific, related domain (i.e. NASA can carry out an investigation on an accident involving one of its spacecraft). However, in both the cases,
at the end of the investigation a final report is published to present the findings.
In order to support their investigation, these agencies use different tools and
techniques to assess facts and causes. The first step of the investigation process is
usually to understand what happened before the accident. One of the most popular
tool for this purpose is Event and Causal Factor Analysis (ECFA). It is a tool
widely used by investigative agencies to describe events and conditions and represent
accident causes. It is adopted and described by U.S. Department of Energy (DOE)
in its handbook as the first stage of the accident investigation [4].
Figure 1.1 provides the basic structure of an ECF chart which will be further
presented in chapter 3.
Basically, this represents only the first step of an investigation, in which different
evidences are structured and organized. However, the process is quite long and
it is composed by several different accident analyses. The role of ECFA in the
investigation is to assess the timeline of events, with conditions in place at the
moment of the accident. After this, ECF chart is updated by the next accident
analyses, whose goal is to assess the causes. Once they have been identified, they
are connected to the ECF chart’s element.
15
Figure 1.1: ECF chart example
1.2.1
NTSB investigation process
The investigation process performed by most of the public agencies and organizations
is basically the same. So, we can describe the process used by NTSB agency, because
it is considered as “the most important independent safety investigative authority in
the world” and “the international standard” about accident investigations [3]. The
Board has investigated approximately 124,000 aviation accidents and 10,000 surface
transportation accidents since its inception in 1967.
From reference [2], “the National Transportation Safety Board was established
in 1967 to conduct independent investigations of all civil aviation accidents in the
United States and major accidents in the other modes of transportation. It is not
part of the U.S. Department of Transportation, nor affiliated with any of DOT’s
modal agencies, including the Federal Aviation Administration (FAA). The Safety
Board investigations focus only on improving transportation safety”.
In the first hours from the notification of a severe accident, the NTSB forms a
Go Team that heads towards the accident scene as quickly as possible to begin the
investigation. The Go Team is composed by different specialists who are responsible
16
for a clearly defined portion of the accident investigation. NTSB has published the
lists of specialties involved in the investigation for the aviation domain [2]:
• Operations: description of the narrative of flight and crew members’ duties.
• Structures: documentation of the airframe wreckage and the accident scene.
• Power plants: examination of engines (and propellers) and engine accessories.
• Systems: study of components of the plane’s hydraulic, electrical, pneumatic
and associated systems, together with instruments and elements of the flight
control system.
• Air Traffic Control: reconstruction of the air traffic services given the plane,
including acquisition of ATC radar data and transcripts of radio transmissions.
• Weather: collection of weather data for a broad area around the accident scene.
• Human Performance: study of crew performance and all before-the-accident
factors that might be involved in human error.
• Survival Factors: documentation of impact forces and injuries, evacuation,
community emergency planning and all crash-fire-rescue efforts.
Under direction of the Investigator-in-Charge, each of these NTSB investigators
heads a “working group” in one area of expertize. The groups are staffed by representatives of the “parties” to the investigation (specifically, the Federal Aviation
Administration, the airline, the pilots’ and flight attendants’ unions, airframe and
engine manufacturers).
The NTSB designates other organizations or corporations as parties to the investigation. Other than the FAA, which by law is automatically designated a party,
the NTSB has complete discretion over which organizations it designates as parties
17
to the investigation: “only those organizations or corporations that can provide expertize to the investigation are granted party status and only those persons who can
provide the Board with needed technical or specialized expertize are permitted to
serve on the investigation” [2].
The investigation lasts many months in which all the possible entities (companies, regulators, emergency bodies, etc...) are involved, in order to reconstruct the
events and assess the possible causes. At the end of this process, descriptions of
facts and analysis are summarized in a draft final report by the Safety Board staff.
Once a major report is adopted, an abstract of that report - containing the Board’s
conclusions, probable cause and safety recommendations - is published.
One of the results of the investigation is the list of safety recommendations.
They are guidelines and advices that stakeholders should implement and address in
order to avoid a similar accident. Recommendations “usually address a specific issue
uncovered during an investigation or study and specify how to correct the situation.
Letters containing the recommendations are sent to the organization best able to
address the safety issue, either a public or a private one” [2]. In fact, they usually refer to underlying problems and organizational deficiencies, while technical problems
are only indicated in the analysis without mentioning by a recommendation.
1.3
Considerations
After the investigation, a source of information is available, which is the accident
knowledge. Such experience is valuable not only for the stakeholders but also for
third-party companies that can effective use such lessons learned.
Although the recommendations provided by the report are relevant for third-
18
party engineers, they need to contextualize such guidelines before to effectively apply
them to their systems.
Moreover, as stated before, technical problems are not addressed by recommendations, even if they are described in detail or just mentioned in the description of
facts or in the post-failure analysis.
For example, this is a recommendation from the final report issued by Dutch
Safety Board for an aircraft crash happened on February 2009 in Nederland: “The
FAA (Federal Aviation Administration) and EASA (European Aviation Safety Agency)
should ensure that the undesirable response of the autothrottle and flight management computer caused by incorrect radio altimeter values is evaluated and that the
autothrottle and flight management computer is improved in accordance with the design specifications” [5]. However, there are neither descriptions of technical solutions
nor references to them through the report.
Our goal is to reuse all the knowledge from an accident, considering both technical
and operational inadequacies, and to structure the information in an effective way.
For this purpose, we have developed a methodology which combines the use of ECFA,
as a standard way to reconstruct the events and identify the causal factors, and
the assurance case, as a way to provide a structured argumentation for a system’s
property [6]. ECFA has been improved to be used not only as an accident causation
model, but also as a guide to identify possible solutions directly connected to the
causal factors. Regarding the assurance case, a new pattern has been developed
to provide a recurring way to structure the accident knowledge identified in the
previous step.
19
Chapter 2
Assurance case, GSN and patterns
In this chapter we present the context of assurance cases by providing the preliminary
background within the context of the proposed methodology. The discussion deals
also with a graphical way to represent the assurance case, namely Goal Structuring
Notation, and a way to reproduce the basic structure of an argument as a template.
2.1
Safety case
The idea behind the assurance case is not a novelty in the industry. As the complexity of critical systems increases, it has become important to assess the safety of
this kind of systems. Originally, safety cases have been widely used as a way to
demonstrate that the system is acceptably safe. It is a structured argumentation
composed by claims and supported by evidences. The term arises from HSE (Health
and Safety Executive) in UK, but it has been widely accepted in different critical
domains as a certification tool.
Figure 2.1 shows an example of a safety case, provided by Origin Consulting
(York) Limited, on behalf of Contributors in [12].
20
Figure 2.1: Example of safety case [12]
21
Safety case has been a de facto standard for a long time in the industry for the
certification of safety critical systems. Since the structured argumentation supporting the safety case has turned out to be persuasive, it has been thought to generalize
this artefact to argue over different system’s properties. Its concept can be used to
assure that any kind of claim is true if there is a convincing argument to support it.
2.2
ISO/IEC 15026 Standard: Systems and Software Assurance
Starting from this point, in 2011 the IEEE Software and Systems Engineering Standards Committee (S2ESC) undertook a long-term program to harmonize its standards with those of ISO/IEC JTC 1/SC 7, the international standards committee
for software and systems engineering. The goal of the committee’s work was to
define and organize a set of concepts and relationships in order to establish a basis
for shared understanding across user communities for assurance. The final result
has been the development of an IEEE Standard, ISO/IEC 15026 - Systems and
Software Engineering - Systems and Software Assurance.
ISO/IEC 15026 standard consists of the following parts:
• Part 1: Concepts and vocabulary [6]
• Part 2: Assurance case [7]
• Part 3: System integrity levels [8]
• Part 4: Assurance in the life cycle [9]
22
Each document has been first issued as a draft, later it has been accepted and
adopted as an IEEE standard. The final assessment of the documents was completed
in November 2014.
We present below an overview on the contents of each part.
2.2.1
Part 1 & Part2: Formalization of assurance case concepts and structure
Part 1 introduces the basic concepts and definitions of assurance-related terms. From
reference [6], the assurance case is a “reasoned, auditable artifact created which
supports the contention that its top-level claim (or set of claims) is satisfied, including systematic argumentation and its underlying evidence and explicit assumptions
which support the claims(s)”. An assurance case contains the following elements and
their relationships:
• one or more claims about properties;
• arguments that logically relate the evidence and any assumptions to the claim(s);
• a body of evidence and possibly assumptions supporting these arguments for
the claim(s);
• justification of the choice of top-level claim and the method of reasoning
The assurance is defined as “ground for justified confidence that a claim has
been or will be achieved”
Part 2 describes the minimum requirements for the structure and contents of an
assurance case to improve consistency and comparability of assurance cases and to
facilitate stakeholders communications and engineering decisions.
23
The following elements represent the structure of an argumentation according to
the standard ISO/IEC 15026. The definitions are provided from both Part 1 and
Part 2:
• claim: “a true-false statement about the limitations on the values of an unambiguously defined property - called the claim’s property - and limitations on
the uncertainty of the property’s values falling within these limitations during
the claim’s duration of applicability under stated conditions”
• justification: “a reason why a claim has been chosen” (e.g. result of risk
assessments, result of requirements analysis, explanations)
• assumption: “a proposition without any reason why it is true”
• evidence: “a fact, datum or object” supporting a claim (e.g. documents, test
results, measurement results, process, product)
• argument: “a reason why a claim is true”. An argument is used to show
how the components directly underlying it, such as claims and evidences, are
related to a claim or a set of claims. It can use different methods of reasoning:
– quantitative (deterministic - e.g. formal proof; non deterministic - e.g.
probabilistic, game theoretic, fuzzy sets)
– qualitative (e.g. staff performance evaluation, court judgements, qualitative statements of event causality)
Formally, we can define the assurance case as a quadruple of a claim c, a justification j of c, a set es of evidence and an argument g which assures c using
es.
24
We can also provide a recursive definition of the assurance case, where × is the
direct product:
A0 = C × {j0 ∈ J(c0 ) | c0 ∈ C} × φf (E) × {g0 ∈ G(c0 , es0 ) | c0 ∈ C, es0 ∈ φf (E)}
(2.1)
Given A0 , the set A of assurance cases and the set of evidence E are defined as
follows:
A = {(c, j, es, g) ∈ A0 | j ∈ J(c), g ∈ G(c, es)}
(2.2)
E =F +D+O+C +A
(2.3)
where
J(c)
is the set of all the justifications for a claim c;
C
is the set of claims;
φf (E)
is the set of all the finite subsets of E;
G(c0 , es0 )
is the set of arguments which assures a claim c0 using a set
es0 of evidence;
F
is the set of facts;
D
is the set of data;
O
is the set of objects.
25
Guidelines
In order to be accepted as an effective argumentation, the assurance case needs to
follow some rules:
• The components of an assurance case shall be unambiguous, identifiable, and
accessible
• An assurance case shall have one that is the ultimate goal of its argumentation
• An argument shall be supported by one or more claims, evidence, or assumptions
• A claim shall be supported either by just one argument, or by one or more
claims, evidence, or assumptions. Therefore, a claim is never a bottom element
of an assurance case
• A claim, evidence, or assumption shall not support itself either directly or
indirectly
• A top-level claim shall have a justification for its choice
• If an assumption is partially warranted or contradicted by evidence, this evidence shall be associated with it
• If an assurance case incorporates another assurance case, the incorporated
assurance case’s top-level claim shall be placed within the original assurance
case’s structure at points where the claim is allowed
• An evidence should be uniquely identified (so that arguments can uniquely
reference the evidence), verifiable and auditable
26
2.2.2
Part 3: System Integrity Levels
Part 3 of the standard specifies the concept of system integrity level as a way to
reach an agreement among stakeholders about the achievement of an objective. The
most common use is to assure that the system or product has property values which
limit related risks during operations.
Integrity levels and standards utilizing them have a significant history especially
in safety. A previous standard about Safety Integrity Levels (SIL) exists: ISO 61508,
namely “Functional safety of electrical/electronic/programmable electronic safetyrelated”, defines functional safety as part of the overall safety which depends on a
system or equipment operating correctly in response to its inputs. In particular, this
standard defines 4 SILs according to the average probability of a failure, where SIL
#1 means the highest range of probability, namely the last critical level.
ISO/IEC 15026 standard provides the basis for a generalized concept of system
integrity level which can be applied not only to the safety but also to other system’s
properties like reliability, maintainability and security.
According to the standard, an integrity level is a claim that “includes limitations on a property’s values, the claim’s scope of applicability, and the allowable
uncertainty regarding the claim’s achievement”. An integrity level requirement
is “a set of specified requirements imposed on aspects related to a system, product or
element and associated activities in order to show the achievement of the assigned
integrity level. This includes the evidence to be obtained”.
The assessment of system integrity level and the integrity level of the elements
composing the system, is based on risk analysis results and system decomposition.
Given that the set of integrity levels is used correctly and that the integrity level
claim concerning the system or product operations is true, the applicable risks are
27
limited or managed acceptably. Figure 2.2 shows an overview of the process for
determining integrity levels.
Figure 2.2: Assessment of integrity levels [8]
In order to show conformance to the integrity levels defined in this standard, documentation shall exist that is accurate, available as required, controlled, traceable,
and reviewable.
28
2.2.3
Part 4: Assurance in the life cycle
Part 4 specifies the possible utilization of an assurance case during the system life
cycle. Moreover, a property-independent list of processes, activities and tasks to
achieve the claim about the system critical property and show the achievement of
the claim itself is presented.
The three main uses of an assurance case are here summarized:
• for an agreement: a supplier needs to show to an acquirer the achievement
of an assurance claim about the values of a critical property of the system or
software product. The agreement might be both a written contract or a verbal
communication.
• for regulation: an authoritative body can use the assurance case developed by
the provider to verify if some critical system properties have been correctly
and/or accurately implemented. The need for such regulation can arise to
certify a critical property of a system or software product.
• for development: the assurance case can be used as an internal asset by engineers and developers to verify if some objectives have been accomplished at a
certain stage of the system life cycle.
As it will be shown in the chapter about the Accident Avoidance Pattern, the
proposed assurance case can be used during either an agreement between acquirer
and provider or the development of the system.
Afterwards, Part 4 document cites the activities and tasks which require the
use and interpretation of an assurance case when a system property or an integrity
29
level is to be assured during the life cycle. The following is the list of all the cited
processes:
• Acquisition
• Supply
• Project planning
• Decision management
• Risk management
• Configuration management
• Information management
• Stakeholder requirements definition
• Requirements analysis
• Verification
• Operation
• Maintenance
For each process, guidances about activities and considerations to be performed
are thoroughly described. Though, because this part is out-of-scope for this thesis,
it will not be further investigated.
30
2.3
Goal Structuring Notation (GSN)
The assurance case provides a structured argument that, if sufficiently convincing,
is an effective way to gain assurance among the stakeholders. Though, it is usually
difficult to follow an argumentation just by reading the textual inferences deriving
from claims, arguments and evidences. People can be confused by the statements
so that they can refuse the argumentation even if it is actually sufficiently effective.
For this purpose, a graphical notation can allow users to physically see how the
elements are connected in the case. Of course, physical connection implies logic
relationship.
As described in the previous chapter, safety cases have been widely adopted as
a certification tool as well as an artifact to support the development process. The
need for a graphical notation had already arisen to describe the safety case. So,
since the idea behind assurance case is a generalization of the safety case’s one, even
the same graphic model can be applied for both the cases.
Goal Structuring Notation (GSN) is a graphical argument notation which
can be used to document explicitly the elements of an argument and the relationship
that exists between these elements. GSN was originated at the University of York in
the early 1990s as part of the ASAM-II project [10], and has undergone significant
development and refinement since then. The early development of GSN has been
heavily influenced by Toulmin’s work on argumentation [11]. Later, in his PhD
work Kelly has added features to GSN in order to support the reuse of safety case
patterns [14]. With the increasing popularity in using GSN to represent safety
cases, industries and organizations have created the GSN Community with the aim
to provide clear guidance in the use of the notation. The standard was developed
between 2007 and 2011, when the version 1 was published [12].
31
2.3.1
The description of the assurance case
The purpose of GSN is to highlight how claims are supported by sub-claims and
evidence through the use of argument elements. Using the Goal Structuring Notation, we can find a one-to-one correspondence between assurance case components
and GSN elements. In GSN, the claims of the argument are documented as goals,
the arguments are called strategies and items used as evidence are documented in
solutions. For assumptions and justifications, elements with the same name exist.
Moreover, a context element is used to better explain the context in which the claim
or the argument should be interpreted.
32
Figure 2.3 illustrates GSN basic elements.
Figure 2.3: GSN basic elements [12]
GSN standard describes claims (goals) as rectangle, contexts as rounded rectangle, arguments (strategies) as parallelogram, evidence (solutions) as circles and
assumptions and justifications as ovals, with a letter to distinguish between them
(A and J, respectively). The connection to claims, arguments and evidences is represented by a black arrow, while the connection to assumptions, justifications and
contexts is indicated with a white arrow.
This is the basic structure provided by GSN. Moreover, it is possible to use
additional features to improve the explanation of the assurance case. Specifically,
these extensions are useful to support argument patterns, as it will be shown later.
33
Figure 2.4 shows an example of assurance case described with GSN.
Figure 2.4: GSN example [12]
First, the notation supports both the multiplicity and the optionality of the
elements. In the example, “Claim 1” needs to be supported by at least one of the
two evidences. In a pattern, this is very important for generalizing the use of similar
multiple claims supporting an argument or another claim.
Secondly, in GSN there is support to represent abstract entities. An uninstantiated entity should be used when, at some later stage, the abstract entity needs to
34
be replaced with a more concrete instance, while an undeveloped entity is useful
when an abstract entity requires both development and instantiation. Still, these
features fit with pattern’s purpose, because most of the elements in the pattern are
represented as abstract entities. In Figure 2.4 ,“Evidence 3” is provided to support
“Claim 3” but it needs to be instantiated in order to concretely support it, while
“Claim 2” is a statement supporting the argumentation but it needs to be developed
through a further argument or an evidence to be effective.
Last, GSN supports modular extension that allows multiple assurance cases
to be interconnected. This feature can be also used when a wide assurance case
needs to be split in several smaller parts in order to better explain them separately.
Figure 2.5 shows the use of the public indicator that allows any kind of element to
be referenced as an away element. In this case, a goal element is made public.
Figure 2.5: Use of Public Indicator
As shown in this chapter, the reasoned and compelling argument provided by
the assurance case is well supported by Goal Structuring Notation. It also allows
the construction of complex argumentations that can be used for different purposes
and in different phases of the system life cycle.
35
2.4
Safety case patterns
As discussed in the previous sections, the purpose of a safety case is to demonstrate
that a system is sufficiently safe to operate. The argumentation is provided through
the use of claims and evidences related to the specific system on which you are
arguing. However, since many systems need to be certified about the same critical
functionalities, the safety of these systems can be assured by the use of similar
safety cases. Of course, each safety case is not exactly the same: for example, two
of them can argue over different requirements, or they can use as evidence similar
results provided by two different analyses (i.e. fault tree or reliability block diagram).
Though, it is valuable that among similar specific safety cases a reproducible pattern
emerges through the argumentation.
The concept of pattern is well known in many contexts as a general, reusable
solution to a commonly occurring problem. In the context of software engineering,
regarding the Design pattern, Christopher Alexander claims that “each pattern
describes a problem that occurs over and over again in our environment, and then
describes the core of the solution to that problem, in such a way that you can use
this solution a million times over, without ever doing it the same way twice” [13].
In the industry, the concept of safety case pattern has arisen whenever engineers needed to certify their systems with similar argumentations. For example,
the same argument over the satisfaction of all the system requirements could be
developed and specified for two different railway systems. Not only, if the pattern is
sufficiently generic, it might be used in different domains by instantiating it according to the standards used in the concerned domain. However, this reusable structure
was just shared within a single company, or at most in a specific domain if regulators
requested the same safety case. Other organizations could not access this knowledge
36
for their purposes.
Kelly has been the first to assess and organize the concepts behind safety case
patterns [14]. He has defined the safety case pattern as “a means of documenting
and reusing successful safety argument structures” described by a graphical notation,
such as GSN. This work has been very important for the creation of GSN standard
too, since many topics about safety case and GSN have been developed in it.
The two most important features defined in [14] are the abstract representation
of a generalized safety argument and the formalized documentation of a safety case
pattern.
2.4.1
Representation through GSN
The first step in the creation of a safety case pattern is its representation. As described in the previous chapter, GSN standard has been extended by Kelly’s work
by introducing features, such as multiplicity and undeveloped entity, in order to support abstraction and modularity. Specifically, we can identify structural abstraction
and entity abstraction, where the first allows generalization of the structure of an
argument while the latter allows generalization of an element in the structure (claim,
context, evidence, justification). This is the starting point to instantiate many times
a structure which can be specialized and extended depending on system and context. Figure 2.6 shows a safety case pattern, “Functional Decomposition Pattern”,
described using GSN.
37
Figure 2.6: Example: Functional Decomposition Pattern [14]
This pattern argues about the system’s safety by claiming that each implemented
function is safe and there are no hazardous interactions between functions. As shown
in the figure, the pattern uses a multiplicity arrow to merge claims about functions’
safety in only one claim; moreover, the abstract representation of two claims and one
context element is supported by “undeveloped” and “uninstantiated” elements. This
kind of pattern represents the initial part of a safety case, so it needs to be extended
and specialized by developing and instantiating the elements described above. The
important point is that the main structure is preserved and can be reused.
38
2.4.2
Documentation
The second step to complete the creation of a safety case pattern is the documentation. Representation is important to describe graphically the structure and the
relationship between elements in the safety case. Though, it is difficult for engineers
or other stakeholders to pick up a pattern from a catalogue by just looking for the
graphical representation. Information such as name, context of applicability and
example of uses, is fundamental to manage the pattern in the correct way.
Starting from the work of Gang of Four for the documentation of Design Pattern
[15], Kelly has defined the format for the description of safety case patterns. The
pattern format is first described in [14], then it has been summarized in [16]. The
following table is adapted from reference [16], showing the fields and the related
descriptions.
39
Pattern Name and
Classification
The pattern’s name should convey the essence of the
pattern succinctly. A good name is vital because, with
use, it will become part of your design vocabulary.
Intent
A short statement that answers the following questions:
What does the pattern do / represent? What particular
safety issue / requirement / process does it address?
Also Known As
Other well known names for the pattern, if any.
Motivation
A scenario that illustrates a safety issue / process and
how the elements of the goal structure solve the problem.
The scenario will help you understand the more abstract
description of the pattern that follows.
Structure
A graphical representation of the pattern using the extended form of the goal structuring notation. The representation can describe a product or a process style goal
structure. Where the structure indicates generality or
optionality, it should be clear how the pattern can be
instantiated.
Participants
The elements of the goal structure and their function in
the pattern.
Collaborations
How the participants collaborate to carry out the function of the pattern.
Table 2.1: Documentation of a safety case pattern, part 1
40
Applicability (Necessary Context)
What are the situations in which the safety case pattern
can be applied? What information is required as context
for the pattern to be successful (necessary inputs to the
pattern)?
Consequences
How does the pattern support its objectives? What are
the trade-offs and results of using the pattern?
Implementation
What pitfalls, hint or technique should you beware of
when using the pattern? What degrees of flexibility are
there in following the pattern?
Examples
Safety case sample that illustrates the instantiation of
the pattern.
Known Uses
Examples of the patterns application in existing safety
documentation should be cited. If possible, examples
from two different applications should be shown.
Related Patterns
Safety Case Patterns that are related to this pattern,
e.g. with the same motivation but different applicability
conditions (e.g. different standards, different systems).
For a process orientated pattern, related product (argument) patterns. For a product orientated pattern,
related process patterns.
Table 2.2: Documentation of a safety case pattern, part 2
41
As stated before, many fields of this format have been adapted from the standard
used to define Design pattern. In his work, Kelly has just formalized a structure
that many industries used to adopt as an internal asset. In this way, whoever needs
to instantiate a safety case template can seek in a catalogue to decide the most
appropriate for its purposes.
2.5
Related work
2.5.1
Safety case lifecycle
In the research area of safety assurance, some works have been developed about
approaches to reuse information from safety cases. The concept of safety case lifecycle has turned out to be important when engineers needed to reassess the safety
of a system involved in a failure. A mishap or an accident is an evidence that the
system was not sufficiently safe as assured by an associated safety case. So, the need
of engineers is to determine why the safety case was not correct and how they can
improve it.
The first main work dealing with this topic has been produced by the research
group of Prof. John Knight. The authors in [17] have proposed a framework to guide
the failure analysis and the development of lessons and recommendations, starting
from a pre-failure safety case and producing an enhanced post-failure safety case.
The conceptual schema of the framework is shown in Figure 2.7.
42
Figure 2.7: Safety-case lifecycle [17]
The inputs of failure analysis are the original safety case, which has proved to
be ineffective and needs to be improved, and the failure evidence. At the end of this
process, both an updated safety case and a list of lessons and recommendations are
produced to be used in the system revision. The details of how the failure analysis
should be performed are provided in a subsequent work, in which a taxonomy of
safety-argument fallacies is defined as well [18].
The presented lifecycle has led the way to the structuring of accident knowledge.
Though, since it needs a pre-failure safety case to be revised, this approach may not
be always applied because this condition could not be satisfied. In fact, it is possible
that pre-failure safety case is not available, because the concerned system is too old
to be documented by a safety case or because it has been upgraded too many times
with studies not well related and documented.
43
2.5.2
Reuse of safety cases
Another research work related to the reuse of safety case is presented in reference
[19]. In this work, the authors have developed a strategy to create safety cases
by retrieving and reusing previous similar artifacts. The approach is based on the
concept of case-based reasoning in which the process of solving new problems is based
on the solutions of similar past problems. Figure 2.8 summarizes the approach.
Figure 2.8: Case-based reasoning for safety case [19]
This proposed methodology starts with a new case that needs to be processed
and stored in a cases repository. The process includes the case description, which
is used to make the case retrievable from the repository whenever a user is looking
for it. Once that a case has been picked up from the repository, a user can revise it
depending on the future use; then, the validated solution can stored in the repository.
However, in this approach, the knowledge is embodied in previous safety cases,
44
and it is not related to the accident experience. Still, it assumes that at least one
safety case is available to be reused, while we consider that no pre-failure safety
cases are available.
45
Chapter 3
ECFMA and Accident Avoidance Pattern:
the methodology
In this chapter we present the proposed methodology to improve the accident knowledge in safety critical domains. In the first section we describe in general the phases
of the approach, by showing the input, the tools and the output. In the second
section the illustration of the accident causation model ECFA and our enhanced
ECFMA is detailed. In the third section we discuss about existing safety case patterns and the motivations which have led us to the development of a new reusable
structure, the Accident Avoidance Pattern.
46
3.1
The methodology
The research work closest to our methodology is the safety-case lifecycle developed
by Prof. Knight and his group. As described in the previous chapter, in their
works the authors use both the pre-failure safety case and the information from the
accident to perform a failure analysis that generates lessons and recommendations
and a more accurate post-failure safety case [17][18].
Our approach is a bit different, because it starts from a different assumption,
namely that a pre-failure safety case is not available. The evidences supporting this
claim can be at least three:
• the system could be too old and many upgrades have been applied during its
life-cycle, with studies and assessment not well related among them;
• an other company, which has been commissioned to improve the actual system,
could not access the previous information for the sake of confidentiality or
unavailability of data;
• the system could be too complex to be completely assured by a safety case in
all of its parts or functions.
In [18] the author confirms the lack of documented safety arguments in most
digital systems, and he proposes a way to derive them retroactively. Instead, we
suppose that it is quite difficult to reconstruct a safety case just from the observation
of the system.
Moreover, since community has worked on a standard for assurance case by
generalizing the concept of a safety case, our approach can be applied not only in
safety critical systems, but also in domain where other system properties are relevant
(i.e. availability, reliability or maintenability).
47
The proposed approach is summarized in Figure 3.1.
Figure 3.1: The methodology
The failure analysis is conducted using the reports published by public investigative agencies. From these documents, not only agencies’ lessons and recommendations but also events and system descriptions are used to determine the accident
causes. Event and Causal Factor Mitigation Analysis (ECFMA), an enhanced version of the adopted ECFA, is used to reconstruct the events, discover the
causes (root, direct and contributory causes) and provide possible solutions. After
the analysis, a post-failure assurance case is developed directly from the analysis
48
outcomes through the instantiation of the Accident Avoidance Pattern, which
argues how the identified problems can be mitigated by the provided solutions.
The following sections illustrate how the two techniques are used in the methodology.
3.2
Event and Causal Factor Mitigation Analysis (ECFMA)
3.2.1
The standard ECFA
The first step of our methodology is the reconstruction of events, conditions and
determination of causes and solutions through the use of ECFMA. In order to present
this tool, it is better to first illustrate the standard technique ECFA.
As described in the first chapter, Event and Causal Factor Analysis is widely
used to describe the events leading to an accident in order to relate conditions and
causal factors. An example of ECF chart is provided in the first chapter from Figure
1.1. In this section, we describe how this analysis is performed.
The ECF chart is a flow chart with the events and decisions plotted on a timeline.
As the event timeline is established, the related conditions and information are linked
to the events and decisions. Understanding why workers did what they did and why
their decisions and actions made sense to them is an essential goal of the accident
investigation.
49
Figure 3.2 shows the basic elements used in a ECF chart.
Figure 3.2: ECF chart elements [4]
The rectangular element is the description of an event or a decision - if available,
it contains also a time information. In order to reconstruct the facts, events are
connected together with straight arrows. It is also possible to create a secondary
chain to describe a sequence of events that has led to a critical event, representing
the beginning of a primary chain leading to the accident. The primary chain usually
ends with the “accident” element; if the goal is to describe a lack of mitigation after
the accident, this element will be at the beginning or in the middle of the events’
chain.
After the reconstruction of the events, investigators usually try to understand
why a certain event has happened - i.e. why a technician has not closed a valve to
mitigate a leak, why an alarm has not been activated, why a system has provided
incorrect values, etc... Whenever they identify conditions that were in place at
50
the time of an event, they connect the oval element “condition” to the concerned
event through a dotted arrow. A more complex structure of conditions can also be
attached in order to determine a “chain of conditions”.
The ECF analysis is performed by reconstructing all the conditions in place
during the events. Investigators need to wonder why such a condition has not
adequate for the situation. At the end of this process, they should find the particular
condition that has originated the unsafe situation. This will be a causal factor of
the accident; in ECFA it is represented by an hexagonal element and attached to
events or conditions.
Figure 3.3 describes the actions performed to determine the causal factors. The
figure is provided by DOE handbook [4].
Figure 3.3: ECF analysis [4]
51
In the determination of causal factors, we can distinguish three kind of causes,
all of them represented by the same hexagon in ECFA:
• direct cause: it is the immediate event or condition that has caused the
accident. Typically, the direct cause of the accident may be derived from the
immediate, proximate event and conditions close to the accident;
• root cause: it is the causal factor that, if corrected, would prevent recurrence
of the same or similar accidents. A root cause may be derived from several
contributing unsafe conditions. It is an higher-order, fundamental causal factor
which addresses classes of deficiencies, rather than single problems or faults.
• contributory cause: it is an event or condition that collectively with other
causes increased the likelihood of an accident but that individually did not
cause the accident itself. It also represents an event or condition which has
not mitigated the unsafe chain of events leading to the accident.
For instance, in a power plant the direct cause of a blackout can be the failure of
a reactive power control, which has determined the outage. However, the failure has
not been prevented by the management of the power control, which may represent
the root cause of this accident. Finally, the blackout has not been mitigated by the
redundant power control, which has turned up to be a contributory cause.
52
3.2.2
The enhanced ECFMA
ECFA is useful to illustrate the events and the conditions in place in that moment.
From this chart, the analysis is performed to find the causal factors that have generated the accident.
ECFA is basically an accident causation model, a diagram used to assess how
and why an accident has happened. So, it does not provide any information about
possible countermeasures that would avoid the accident or mitigate it, because its
original purpose is just to analyze the causes.
However, the simple but straightforward structure of ECF chart can be also used
to provide a clear depiction of a sequence of events. In our case, it can be used after
the investigation process itself to represent graphically the textual narrative of the
events that are written in the final reports published by investigative agencies.
Moreover, if a possible solution for a discovered problem is already available,
a user can easily relate it to the problem and provide an enhanced depiction of
what happened and how it would have been avoided. Usually this information is
contained in the final reports, in which countermeasures to direct causes, solution
to contributory causes and recommendations to root causes are described or listed.
The enhancement to ECFA consists in the introduction of a new element, not
related to the field of accident causation models, which is the solution element.
53
Figure 3.4 shows an example of ECFMA, with the use of element “solution”.
Figure 3.4: ECFMA example
As shown in the figure, the solution is directly introduced in the ECF chart by
attaching it to the related causal factor. In this way, it is possible for an engineer to
take in account both causes and solutions before to demonstrate that such a solution
is actually effective. This is an intermediate step towards the construction of the
assurance case. In fact, as it will be shown in the next section, the claims of the
assurance case is directly derived from these two elements.
54
3.3
Accident Avoidance Pattern
The second step of our methodology is the construction of the assurance case from
the results of ECFMA in order to structure the achieved knowledge through an
argumentation.
For this purpose, we first have investigated the existing safety case patterns in
order to reuse an accepted structure to elucidate the knowledge. Our starting point
has been the patterns catalogue provided in reference [20]. Table 3.1 lists about 20
patterns used in industry.
Pattern Name
Functional Decomposition Pattern
High-Level Software Safety Argument
Software Contribution Safety Argument
SSR Identification Software Safety Argument
Hazardous Contribution Software Safety Argument
SW Contribution Safety Argument with Grouping
Hazard Avoidance Pattern
Fault Free Software Pattern
ALARP (As-Low-As-Reasonably-Practicable) Pattern
Component Contributions to System Hazards
Hazardous SW Failure Mode Decomposition Pattern
Hazardous Software Failure Mode Classification Pattern
Software Argument Approach Pattern
Absence of Omission Hazardous Failure Mode Pattern
Absence of Commission Hazardous Failure Mode Pattern
Absence of Early Hazardous Failure Mode Pattern
Absence of Late Hazardous Failure Mode Pattern
Absence of Value Hazardous Failure Mode Pattern
Effects of Other Components Pattern
Handling of Hardware/Other Component Failure Mode
Handling of Software Failure Mode
At Least As Safe Argument
Requirements Breakdown Pattern
Table 3.1: Pattern catalogue taken from reference [20]
55
Such patterns have been completely described in other research works, according to Kelly’s representation and formalization [21][22][23][24][25]. Focused on this
summary about safety case patterns, we have surveyed all the proposed templates
in order to choose the most appropriate one.
Yet, most of them do not seem directly applicable for our purpose. In some cases
the argumentation is addressed over system requirements, which we do not gain
from the accident knowledge (i.e. SSR Identification Software Safety Argument);
in other cases the argument is conducted over the risk probabilities which need to
be acceptably low (i.e. ALARP). In some other the argumentation on the hazard
limitation is provided through both the avoidance of hazards and the addressing
of system requirements (i.e. Hazardous Contribution Software Safety Argument,
Figure 3.5) or the pattern argues the safety of the system over the safety of single
system functions (i.e. Functional Decomposition Pattern, Figure 2.6, chapter 2.4.1).
Figure 3.5: Example: Hazardous Contribution Software Safety Argument
[23]
56
Moreover, the goal of most patterns is to argue over the safety of the system
itself, without dealing with a specific accident experience, and they are used at the
highest level of a complex safety case.
However, among these patterns, Hazard Avoidance Pattern is close to our requirements (Figure 3.6).
Figure 3.6: Example: Hazard Avoidance Pattern [14]
The approach is to argue that the system is safe by proving that every identified
hazard has been mitigated or eliminated. Yet, it is too generic for our work, because
it can be used only in the highest level of a safety-case and it does not provide claims
about the solutions to the discovered hazards. For these reasons, we have built a
new pattern, namely “Accident Avoidance Pattern”, which can elucidate the
accident knowledge in a way similar to Hazard Avoidance Pattern but more specific
than it.
57
3.3.1
Construction and formalization
Basically, at the highest level it is possible to identify two types of argument approach:
• Functional decomposition argument
• Hazard directed argument
The first type argues over different system functions or requirements. Examples
of this category are Functional Decomposition Pattern and ALARP. The second type
argues over different hazards that could affect the system. Examples are Hazard
Avoidance Pattern and Hazardous Contribution Software Safety Argument. Since
we have knowledge about discovered hazards, our pattern will belong to the second
category.
In order to create a new pattern, we have followed the process described by Kelly
to build every kind of argumentation. Figure 3.7 illustrates the six step process for
goal structuring development, adapted to assurance case’s terminology.
As shown in the figure, this is an iterative process. The first step is to identify
the claims; specifically, the first action is the definition of the top-level claim, which
is basically the goal of the whole argumentation. Along with this, you need to define
elements that complete the claim, at least a context and, if necessary, a justification
why the top-level claim has been chosen, and possible assumptions.
Then, you need to develop a strategy to support the top-level claim by choosing
an argument. Even this element can be possibly completed by contexts and assumptions. Once the argument has been set up, you need to elaborate it by developing
the argumentation through different sub-claims (in the figure, you come back to
phase 1). The process of structuring the sub-claims follows the same instructions
used for the top-level claim.
58
Figure 3.7: Six step process
At the end, when all the claims and arguments have been defined, you need
to support them by adding the evidences to each sub-claim. However, if you are
defining a pattern, you may not add the evidences. In this case, it is up to the user
to choose the appropriate data and results to be used as evidence.
As result of this six-step process, we have developed the following “Accident
Avoidance Pattern”, presented in Figure 3.8. The formalized documentation of this
pattern is reported in Appendix A.
59
Figure 3.8: Accident Avoidance Pattern
60
In this pattern, the intent is to argue that a specific accident, which could generate severe consequences, can never occur in future by showing the addressing of
the identified hazards. It can be used either as stand-alone or as a support for a
higher-level safety case in which the safety is assured by showing the satisfaction of
safety system requirements and/or how different identified accidents can be avoided.
This pattern has been inspired by the Hazard Avoidance Pattern, but it provides a
more specific structure for the argumentation over the hazards that can lead to the
accident.
Since our objective is to assure that a similar accident can never happen again
in future, we have chosen as top-level claim the statement “Accident X can never
occur using system Y”. In order to be attached to another, higher-level assurance
case, we have generalized it with the public indicator.
The justification of why the accident has been chosen is that “It can cause consequences U if some hazards haven’t been addressed”. Then, we need to define which
kind of system and what accident we are talking about: the two context elements
“System X operating role and context” and “Description of the accident Y” are employed for this purpose.
A sub-claim that better clarifies the top-level claim follows: it states that “All
possible hazards in accident Y have been addressed”. The context element “Context of
identified hazards” is used to list all the hazards discovered through the reports and
analyses. The assumption “All possible hazards in accident Y have been identified”
assumes that any other hazards not experienced in this accident cannot happen
using this system. If this assumption is considered too strong, it can be proved by
one or more evidences.
61
After this, we need to develop the strategy. In this case, as in the Hazard
Avoidance Pattern, the strategy is an “argument over each hazard”. So, we need
to specify that “each hazard W has been mitigated or eliminated”. Differently from
every other pattern, in our case the argumentation continues through the addressing
of each hazard by showing the specific proposed solution.
In this phase, we can use directly the results from ECFMA, where solutions have
been already structured and related to the problems. The claim “Solution Z will
avoid hazard W” is constructed from this information. As depicted in the pictures,
we have used the multiplicity element to describe the argumentation over more than
one hazard and to provide more than one solution to each hazard, if necessary.
The evidences are not described in the pattern, as usual, so once the pattern is
instantiated they need to be provided and attached to prove that the solutions are
effective.
62
Chapter 4
Case studies
In this chapter we present two case studies which have been performed to illustrate
how the developed methodology works in practise. They have been taken from two
different safety critical domains, the first belonging to aerospace domain and the
latter referring to a critical communication network.
4.1
DART spacecraft collision
4.1.1
Accident and system role context
The Demonstration of Autonomous Rendezvous Technology (DART) program began in May 2001, designated by NASA (National Aeronautics and Space
Administration) as an high-risk technology project, with the objective to demonstrate that a spacecraft could autonomously rendezvous with the orbiting Multiple Paths, Beyond-Line-of-Sight Communications (MUBLCOM) satellite,
without human intervention.
The mission was launched on April 15, 2005. DART operated properly as planned
63
during the first eight hours of mission, accomplishing all objectives up to that time.
However, during proximity operations to MUBLCOM, the spacecraft started to use
much more fuel than expected. Approximately 11 hours after the launch, DART
detected that its propellant supply was almost exhausted, and it began a series of
maneuvers for retirement. Although it was not known to ground personnel at the
time, DART had actually collided with MUBLCOM 3 minutes and 49 seconds before
initiating retirement. Out of a total 27 defined mission objectives, DART met only
11 of those objectives at the end of the mission.
For this reason, NASA convened a Mishap Investigation Board (MIB). At
the end of this process, an overview of the DART mishap investigation results has
been publicly released [26]. Most of the following information about mission and
system description have been gained from this report.
The DART navigational system was guided by a pre-programmed, autonomous
software system designed to use data from both an Advanced Video Guidance Sensor
(AVGS) on DART and three Global Positioning System (GPS) receivers (two on
DART and one on MUBLCOM). Utilizing a complex algorithm to combine data
from the AVGS and GPS sensors, the navigational system would have calculated
velocity and position of DART relative to MUBLCOM to determine how to use its
thrusters to approach the satellite.
The DART Mission Plan consisted of four phases: (I) Launch and Early Orbit,
(II) Rendezvous, (III) Proximity Operations, and (IV) Departure and Retirement.
In the first phase, the DART spacecraft, together with its Pegasus launch vehicle,
would have been carried aboard a carrier aircraft. From there, the Pegasus rocket
would have ignited, carrying DART into an early orbit below MUBLCOM. In the
Rendezvous phase, after completing systems checks, DART would have fired its
thrusters to move into a second phasing orbit; in this phase, navigational system
64
would have been guided only by GPS data. In the third phase, DART would have
been led in the MUBLCOM’s orbit; here, it would have used the AVGS data instead
of GPS data, to perform a series of precise, accurate manuevers with the satellite.
In the last phase, DART would have moved away from MUBLCOM, expelled its
remaining propellant, and remained in a retirement orbit.
4.1.2
ECFMA analysis
In order to determine why the DART spacecraft has collided with MUBLCOM, we
have reconstructed the events using the description of mishap provided by MIB final
report. The causes and the possible solutions have been gained from the report
itself.
Figure 4.1 gives a bird’s view of ECFMA diagram built for DART collision. Since
it is quite big, we have summarized the content of each element with just one word,
as in the style of ECF chart example in chapter 1 (figure 1.1). The purpose of this
view is just to give an idea of the ECFMA final structure.
Of course, it has been split in many parts in order to be much more readable
and to give details about every elements. Figures 4.2 to 4.5 provide the split parts.
65
66
Figure 4.1: ECFMA chart of DART collision: bird’s view
67
Figure 4.2: ECFMA chart of DART collision: initial events
68
Figure 4.3: ECFMA chart of DART collision: middle events
69
Figure 4.4: ECFMA chart of DART collision: upper conditions
70
Figure 4.5: ECFMA chart of DART collision: final events
As shown in the diagram, there are several causes contributing to the collision.
Specifically, we have identified 9 causes, that are listed below in Table 4.1.
HAZARD
TYPE
Inadequate assessment of project technical risks and review of project’s risk level classification
ROOT CAUSE #1
Lack of adequate documentation of flight code changes
and pre-flight simulation and testing that had not taken
these changes into account
ROOT CAUSE #2
Reuse of software architecture from a launch vehicle
that was inadequate for autonomous space operations
because of its lack of adaptability to unanticipated inputs
ROOT CAUSE #3
Failure to utilize lessons learned from past NASA
projects
ROOT CAUSE #4
Inaccurate navigation system measurements from the
primary GPS receiver to determine DART’s position
and velocity to the MUBLCOM
DIRECT CAUSE #1
Level of gain set at such a level that the calculations
could never converge once the initial reset happened and
that determined the infinity-loop reset
CONTRIBUTORY
CAUSE #1
Waypoint for the switchover too small
CONTRIBUTORY
CAUSE #2
The software logic for collision avoidance system was
dependent on the same navigation system
CONTRIBUTORY
CAUSE #3
Ground operator had not the capability to drive the
DART remotely
CONTRIBUTORY
CAUSE #4
Table 4.1: DART’s collision: list of identified hazards
First of all, the direct cause of the accident has been the inaccurate measurements
from the primary GPS receiver. This failure has not been mitigated neither by the
collision avoidance system, because of its dependance on the same navigation system
(contributory cause), nor by ground personnel, because of completely autonomous
nature of the mission (contributory cause). However, even if the measurements were
71
inaccurate, the area for the switchover from GPS to AVGS had been designed too
small to cover such a mistake (contributory cause).
Since the project has changed its significance during the design and implementation, the risk level has increased; though, it has not been reviewed (root cause).
Moreover, even if it was an important, high-risk mission for NASA, it has been
also designated as a low-budget project; for this reason, the reuse of an architecture
from a launch vehicle and modification to flight code have been performed without
thorough analysis and simulation (root causes). Specifically, a parameter causing
an infinity-loop reset in navigation system has been changed without a correct validation (contributory cause). Most of these problems would have been avoided with
the use of lessons learned from past NASA projects (root cause).
4.1.3
Assurance case
Following the use of ECFMA to analyze the accident, we need to create the assurance
case in order to argue that the proposed solutions to the identified problems will
actually avoid the same accident.
Figure 4.6 shows the instantiation of the Accident Avoidance Pattern for DART
collision. As for the ECFMA chart, some excerpts have been provided to zoom on
the details of hazard’s and solution’s claims. Figure 4.7 shows the highest part of
the assurance case.
72
73
Figure 4.6: Assurance case of DART accident: bird’s view
74
Figure 4.7: Assurance case of DART accident: top-level claim
The instantiation of the Accident Avoidance Pattern follows the process described in the previous chapter in Figure 3.7. The top-level claim “Collision with
MUBLCOM can never occur using DART spacecraft” refers to a kind of accident
whose consequences - “damages and the premature end of the mission” - are well
known. The three contexts element - “DART operating role and context”, “Collision
with MUBLCOM context” and “Context of the identified hazards” - give the basis
for the argumentation; since these elements could be too big in the assurance case if
completely explained, we have indicate a reference to the previous section in which
the information has been already provided. The assumption about the identification of all the hazards is supported by the existence of a fully detailed investigative
report, MIB report.
Figures 4.8 to 4.12 illustrates the argumentation provided for each hazards and
solutions. It has been split in five parts, showing the details of the involved elements.
75
Figure 4.8: Assurance case of DART accident: first excerpt
76
Figure 4.9: Assurance case of DART accident: second excerpt
77
78
Figure 4.10: Assurance case of DART accident: third excerpt
Figure 4.11: Assurance case of DART accident: fourth excerpt
79
Figure 4.12: Assurance case of DART accident: last excerpt
80
In these other excerpts we can see the instantiation of claims referring to hazards
and problems got by directly using the results of ECFMA analysis. There are 9
discovered causes - both root, direct and contributory causes - and one solution for
each of them. In order to demonstrate the effectiveness of this argumentation, we
need to support the claims using evidences. Since we have had not the capability
to perform a thorough and full analysis on a real system, we have attached a list of
possible evidences that engineers can use to prove the proposed solutions. This is a
common practise when, at a certain stage of safety case development, some elements
remain uninstantiated or undeveloped.
The attached evidences belong to a category are not a specific kind of analysis
or tool: for example, we have indicated “Reliability testing results” to support the
claim that “the use of a minimum fault tolerant system will avoid the hazard“, so
engineers could use different tools, such as Reliability block diagram or Fault Tree
Analysis, to gain these results. The information about all the possible evidences both qualitative and quantitative - that can be used to support an argumentation
have been taken from a survey on provision of evidence for safety certification [27].
In this way, the attachment of evidence completes the building of assurance case to
structure the knowledge of DART collision.
81
4.2
Multistate 911 outage
4.2.1
Accident and system role context
During the night between April 9 and April 10, 2014, a 911 call-routing facility
in Englewood, Colorado, stopped to route 911 calls towards seven American states
- California, Florida, Minnesota, North Carolina, Pennsylvania, South Carolina,
and Washington - causing the failure of about 7,000 emergency calls. The loss
of 911 service affected more than 11 million people and it was prolonged for six
hours. Fortunately, there were no deaths or severe injuries as result of the emergency
communication loss.
The accident has involved the Next Generation 911 (NG911) system, a new
network that relies on IP-supported architecture instead of the traditional circuitswitched time division multiplexing (TDM) architecture, with the aim to provide
new capabilities such as dynamic call routing and video transmission. However, as
demonstrated by this outage, there are also new challenges about reliability and
safety. That’s because, as 911 has become a more technological network, the interaction of new and old systems has introduces new vulnerabilities and hazards.
In order to understand the problems in a such evolving infrastructure and improve
the complete deployment, the Public Safety and Homeland Security Bureau
(PSHSB) investigated on this accident, concluding the investigation with a final
report released in October 2014 [28].
On the day of the outage, the 911 architecture was in a transition stage between
conventional 911 network and the NG911. Different companies are involved in this
infrastructure. The main ones are Intrado, a provider of 911 and emergency communications infrastructure, systems, and services for state and local public safety agen-
82
cies throughout the United States, and CenturyLink, which maintains the Washington’s Emergency Services IP Network. Several service providers produce emergency
calls to be routed through the network to an answering center.
Figure 4.13, provided from report [28], depicts the transition architecture used in
the State of Washington at the time of the accident. The red elements are managed
by Intrado, the green ones by Century Link. It is also indicated the Englewood
facility in which the failure has originated.
Figure 4.13: Washington NG911 Transition architecture [28]
In this infrastructure, “a caller dials 911, and the call is routed through the
network of the originating service provider to one of four Intrado gateways serving
83
Washington State, two in the Seattle area (western Washington State), and two in
the Spokane area (eastern Washington State). This gateway, which converts the
signal from TDM (Time Division Multiplexing) to IP, is also 911-aware and queries
other databases to determine the primary Internet Protocol Selective Router (IPSR)
for the PSAP that serves the caller’s location. Under normal conditions, the gateway
then routes the call to the primary IPSR through a managed IP network, some
of which belongs to CenturyLink and other parts of which are provided for those
purposes by Intrado. The IPSR is also 911-aware. It queries various databases
(shown as “911 DB” in Figure 4.13) to identify the correct Public Safety Answering
Point (PSAP) and to properly address packets to that PSAP. The call is then routed
through the “CenturyLink IP Network” to the PSAP. The IPSR is no longer located
in the local exchange carrier (LEC) central office, or even in Washington State, but
is now in Colorado, with a single “manual failover” backup in Florida. As is often
the case in conventional 911 architecture, databases are also located in other states.”
[28]
Since both traditional TDM and IP-based service calls need to be routed through
the same architecture, engineers have provided IPSR with the capability of assigning
a PSAP Trunk Member (PTM) when a TDM call is served by that IPSR. As we
can see in the next section, this function has been involved as proximate cause of
the outage.
4.2.2
ECFMA analysis
Starting from the facts and analysis published in the PSHSB report and using the
previous architecture description, we have performed ECFMA analysis. As in the
DART’s case of study, we provide both a bird’s view and excerpts of ECFM chart.
84
85
Figure 4.14: ECFMA chart of 911 outage: bird’s view
In this outage, we have identified 6 main causes. They are listed in the table
below.
HAZARD
TYPE
Lack of network workload analysis
ROOT CAUSE #1
Call control and management functions not adequately
balanced among ECMC facilities
ROOT CAUSE #2
Too much dependance on few critical elements without
adequate safeguards in place
ROOT CAUSE #3
Low threshold for PTM counter
DIRECT CAUSE #1
Inadequate alarm management to generate an alarm for
a major outage
CONTRIBUTORY
CAUSE #1
Lack of communication among different involved
providers
CONTRIBUTORY
CAUSE #2
Table 4.2: 911 outage: list of identified hazards
The proximate cause of the outage has been a software error regarding the PSAP
Trunk Member (PTM) counter, which traces the number of calls handled by the
IPSR. It has been assessed that the threshold in PTM counter was too low for the
generated workload. This setting comes from the lack of a correct workload analysis
on the network (root cause).
Moreover, the report identifies two architectural problems, regarding the balance
of call control and management functions and the overload of few facilities in routing
the calls, respectively (root causes).
Finally, after the outage it was difficult to pinpoint the facility generating the
problem. Two contributory causes have been discovered: an inadequate alarm system, which has generated only low-level alarms in response to the outage, and a lack
of communication among service providers to help each other in locate the problem.
The effect has been a delay of six hours in solving the outage and recovery the emergency infrastructure. Figures 4.15 to 4.17 shows the detailed events with the related
causes.
86
Figure 4.15: ECFMA chart of 911 outage: initial events
87
88
Figure 4.16: ECFMA chart of 911 outage: accident
89
Figure 4.17: ECFMA chart of 911 outage: post-accident events
4.2.3
Assurance case
Figure 4.18 illustrates the assurance case bird’s view derived for 911 outage. Again,
as for DART’s case, we have derived excepts to better show the details of the argumentation. Figure 4.19 shows the highest part of the assurance case.
The creation of the case follows the same six-step process used for the previous
DART case. The first element to be instantiated is the top-level claim “Multistate
911 outage can never occur using NG911 infrastructure”, which has been chosen
because it can cause “potential damages and injuries”. Note that the outage has not
determined deaths or other severe losses, but they potentially may have occurred.
Moreover, this case can be used as a support in an higher-level assurance case in
which we want to argue about the reliability of the system, instead of its safety.
The three context elements - “NG911 operating role and context”, “Multistate 911
outage” and “Context of the identified hazards” - give the basis for the argumentation;
as in the previous case, we have used a reference to the description section in which
the information has been already provided in order to make the assurance case more
readable. The assumption about the identification of all the hazards is supported
by the PSHSB report, released after a five-month investigation.
90
91
Figure 4.18: Assurance case of 911 outage: bird’s view
92
Figure 4.19: Assurance case of 911 outage: top-level claim
Excerpts about hazards and the related solutions provided by ECFMA follow
in the assurance case (Figures 4.20 to 4.22). In this case of study, we have identified six causes. The solutions have been got directly from the report, within the
analysis chapters. In this example, we have also identified two possible solutions
to eliminate an hazard: specifically, “inadequate alarm management” hazard can be
solved by both “an adaptive alarm management” and “the update of alarm severity
and troubleshooting instructions”.
Finally, in order to complete the argumentation, we need to attach the evidences
to the solution’s claims. Wherever possible, we have used a concrete evidence. This
is the case of the solution’s claim “higher limit value for PTM” supported by the
“Post-accident actions” evidence, since the limit has been modified some days after
the outage and its effectiveness as countermeasure has been demonstrated. However,
for all other solutions we have attached a list of possible evidences that engineers
can use to prove them. Again, we have used as generic evidences the suggestions
provided in the survey from reference [27].
93
94
Figure 4.20: Assurance case of 911 outage: first excerpt
95
Figure 4.21: Assurance case of 911 outage: second excerpt
96
Figure 4.22: Assurance case of 911 outage: third excerpt
4.3
Discussion on the methodology
In the previous sections we have presented two case studies in order to discuss on
the methodology using their results. We want to show the improvement of accident knowledge comparing the use of the sole list of recommendations, issued by
investigative agencies in their final reports, and the use of our methodology.
First of all, we focus on the nature of recommendations and their effectiveness.
As indicated in the NTSB Investigative process in reference [2], safety recommendations “are based on the findings of investigation, and may address deficiencies which
do not pertain directly to what is ultimately determined to be the cause of the
accident”. This means that they address underlying problems and organizational
deficiencies, most of them corresponding to the discovered root causes.
In Table 4.1 from Section 4.1.2 we have listed the identified causes for DART’s
case of study. In the Table 4.3 we have, instead, indicate the correspondent safety
recommendations as called in the report, if available, which address the hazards.
DART’s HAZARD
RECOMMENDATION
ROOT CAUSE #1
“Risk posture management”
ROOT CAUSE #2
“Guidance, Navigation and Control (GN&C)
Software Development Process”
ROOT CAUSE #3
“High Risk, Low Budget Nature of the Procurement”
ROOT CAUSE #4
“Lessons Learned Analysis”
DIRECT CAUSE #1
“FMEA” recommendation
CONTRIBUTORY CAUSE #1
none
none
“FMEA” recommendation
none
Table 4.3: DART’s collision: correspondence with recommendations
97
As we can see, 6 out of 9 causes (66%) have been reported in the final recommendations with suggested ways to correct the problems. Among them, there are
all the root causes, the direct cause but only 1 out of 3 contributory causes. In
fact, the remaining causes not mentioned in the recommendations list are related to
specific, technical problems: the incorrect gain parameter, which has determined the
infinity-loop reset, the small waypoint and the incapability of remote control from
ground. Moreover, among the causes, 4 out of 6 refer to the “process”, intended as
the design and operation actions that people should take, while the remaining 2 deal
with the final “product”, which is the system itself.
The correspondences for 911 outage have been summarized in the following table.
911’s HAZARD
RECOMMENDATION
ROOT CAUSE #1
none
ROOT CAUSE #2
none
ROOT CAUSE #3
none
DIRECT CAUSE #1
none
none
“Contractual relationship monitoring”
Table 4.4: 911 outage: correspondence with recommendations
In this case, only 1 out of 6 causes (16%) has been worked out in detail in the final list of recommendations. It is a contributory cause regarding the communication
problem - an organizational deficiency - and it refers to the “process” of managing
the system. Solutions and countermeasures for the other hazards, most of whose
are technical issues, are described through the report. Though, they are not referenced by safety recommendations; there is just one of them, “Develop and Implement
NG911 Transition Best Practices”, where they stated to have “shed light on a number of measures that providers can take to improve service reliability during this
transition”, although they don’t give any references to them.
98
With this two case studies, we have shown that the list of recommendations is
not enough to cover all the causes and avoid a similar accident. That’s because of
the recommendations’ nature, which mainly deal with high-level problems. Instead,
our approach aims to work out all the causes - root, direct and contributory causes using all the knowledge provided by the reports. In this way, we can eliminate both
organizational deficiencies and technical problems, so that we can avoid both specific
hazards turned up in the accident and potential problems that have not turned up
in the episode.
The other main problem of safety recommendations is their understandability. We
can define this property as the quality of information which makes it understandable
by people with reasonable background knowledge of business and technical activities.
If we imagine an organization that is not directly involved in the accident but
that can experience a similar mishap in the same domain, the simplest way to reuse
this knowledge is reading and applying the published recommendations. Though,
these statements are directed to the stakeholders in a way that it is difficult for this
third-party organization, not involved in the episode, to implement the advices.
In our approach, an organization can use all the knowledge provided by the final
reports - not only the recommendations, but also facts, analyses and descriptions,
even from other documents - to reconstruct events and conditions. After this phase,
performed through ECFMA, engineers can structure the knowledge in an argumentation. The use of ECFMA, in which solutions have already been connected to the
problems, is useful to identify and solve the problems, while the instantiation of
the assurance case pattern is performed to demonstrate the validity of the solutions
themselves. In fact, even if some solutions are taken from the same recommen-
99
dations, their effectiveness is argued through the use of specific evidences, which
engineers can use to confirm or refuse the argumentation.
Another property, related to the understandability, about the knowledge that we
are elucidating is the learnability. We can consider the concept behind this property using the Standard ISO/IEC 9126-1:2001 for the product quality, in which the
learnability is defined as “the capability of the software product to enable the user
to learn its application”. In our context, we can define it as “the capability of an
information to enable the user to learn its application” [29].
We can imagine a spectrum: the best knowledge is an information in which,
given a problem, there is a straightforward solution to work it out, while the worst
one is an information with just an identified problem without any suggestions on
how to solve it. In our case, we have seen that the nature of recommendations
is deliberately general, with a brief, textual synthesis of a problem and a generic
suggestion on how to mitigate the problem. This is true not only for high-level
problems, such as the organizational deficiencies or the communication troubles,
but also for the few technical issues that are reported. For example, in DART
case of study the “FMEA recommendation” states that “NASA should define the
minimum fault tolerance required for spacecraft performing rendezvous missions in
order to protect space assets from collision”. It can be considered a requirement
more than a way on how to design and develop a fault tolerance system for the
spacecraft. Instead, in our methodology we can insert in the argumentation both
technical countermeasures, which are usually found out in the report analyses, and
solutions from the recommendations. In this way, even if they are considered as
requirements to be implemented, their effectiveness is proved by evidences. This
100
makes the gained knowledge easier to be applied in practise.
Moreover, along with the understandability and learnability, we can discuss on the
reusability of our methodology, where this property is defined as the ability of an
item that allows it to be used repeatedly. In our context, we can imagine the use of our
methodology as an asset by engineers. If they want to identify from an accidents
catalogue a similar occurrence, they need to read the report and find out causes
and solutions in order to verify whether these elements are relevant for them or not.
This operation can be performed through the reuse of the assurance case instantiated
previously through the Accident Avoidance Pattern, in which problems and solutions
are well defined and supported by evidences. For example, a problem such as “the
inadequate balance of call control and management functions” in the 911 outage
can occur again in an other emergency communication network: if the engineers
consider this problem as relevant for them, they can apply the provided solution
about “the reviewed ingress trunking configuration distribution”. By demonstrating
the effectiveness of the solution through the “configuration management plan” and
the “performance testing results”, they will have avoided a potential undiscovered
hazard. Moreover, all the problems are described at the same level of the assurance
case as well as the solutions are represented in the same sub-level. These features
allow the whole artifact, or parts of it, to be used repeatedly even by other engineers
operating in the same domain.
Finally, a quality of our assurance case pattern is its flexibility. This refers to two
features. The first one is its possible use either as a stand-alone assurance case or
101
as support to an higher-level assurance case; the second one deals with the system’s
property which is argued in the higher-level assurance case. In fact, the top-level
claim referring to an accident, namely “Accident X can never occur using system Y”,
allows such higher-level case to argue not necessarily over the system’s safety, but
also other properties, such as reliability, availability or maintenability by proving
that different identified accidents cannot occur in the concerned system.
There are also some possible drawbacks in our methodology. One of them could
be the lack of a logic relationship in the assurance case between root causes, direct
causes and contributory causes, which are treated at the same level. Though, the use
of the context element about the identified hazards makes clear the nature of each
cause. Moreover, in order to avoid a similar accident we claim that it is necessary
to address all the hazards, even the ones which have been contributory causes,
since they would have mitigated the mishap. For this reason, it may be easier to
argue them one by one at the same level of the assurance case by developing the
argumentation in a vertical way.
Regarding the hazards, we state that all the possible hazards highlighted by
the accident have been identified by the investigative agency. This assumption is
quite acceptable, since the investigation lasts many month involving contracting
authorities, system providers, regulators and emergency bodies, and it is conducted
by a Board that has experience of similar accidents in the same domain. Though, it
is possible that other hazards for the same system have not turned up in this specific
accident, which is quite challenging to assure completely. However, our approach
aims to reduce the risk of new potential problems by solving the root causes and,
so, minimizing the probability of other undiscovered hazards.
102
Future work
We have also highlighted some possible features of our methodology that can be
reviewed in future. One of them is the use of the Accident Avoidance Pattern
attached to an higher-level assurance case. In our case studies we have considered
the instantiation of the pattern as stand-alone, without claiming about any system’s
property. If, instead, it is used in an high level assurance case, it could argue over the
addressing of system requirements and the avoidance of several identified accidents.
In this case, it may be that some solutions provided to solve discovered hazards
have already been implemented according to a system requirement, so that it can
be redundant in the assurance case. For this reason, it is valuable to identify these
overlapping claims and eliminate the redundancy.
Moreover, we can evaluate other properties of the methodology through the use
of a real case. One of them could be the efficiency: we can show that our approach is
also efficient in time if the recommendations are not clear and specific enough to be
applied, as they are in most of the cases. However, this property could be evaluate
only through quantitative measurements about the time elapsed in applying either
the methodology or the list of recommendations in a real case.
103
Conclusion
In this thesis the goal was to develop a methodology for reusing and elucidating the
accident knowledge in safety critical domains. We have used the concepts of the
new standards about Assurance case and GSN (Sections 2.2 and 2.3). Assurance
case is a novelty in research, since it represents a generalization of safety case,
which is, instead, a standard de facto in industry for the certification of safety
critical systems. Further, the thesis’ contribution includes the investigation on works
about the safety case lifecycle (Section 2.5.1), which, despite its relevance, has not
been deeply developed in the research area of safety assurance. Moreover, we have
illustrated ECFA (Section 3.2.1), an investigation tool used by public agencies to
reconstruct what happened in an accident. By combining these elements we have
developed a methodology, whose results can be used as an agreement between a
supplier and an acquirer or as an asset by developers to increase knowledge of a
safety critical domain.
We have also highlighted some possible features of our methodology that can
be reviewed in future. For example, the application of the approach in a real case
can be an excellent way to evaluate its efficiency and its use as part of a complex
assurance case. However, these ideas agree with the objective with which we began
this thesis: managing to improve the knowledge of computer systems employed in
social domains.
104
Appendix A
Accident Avoidance Pattern formalization
The following table reports the formalized documentation of the Accident Avoidance
Pattern, as described by Kelly and presented in chapter 2.4.2.
Accident Avoidance Pattern
Author
Mirko Napolano
Created
17/11/2014
Last modified
13/02/2015
105
Intent
The intent is to argue that a specific accident, which could generate severe
consequences, can never occur in future by showing the addressing of the
identified hazards. It can be used either as stand-alone or as a support
for a higher-level safety case in which the safety is assured by showing the
satisfaction of safety system requirements and/or how different identified
accidents can be avoided.
Motivation
This pattern has been inspired by the Hazard Avoidance Pattern, but it
provides a more specific structure for the argumentation over the hazards
that can lead to the accident.
By defining the hazards for a specific accident, arguing about the solutions that avoid them is easier than arguing over the safety of the whole
system.
Moreover it can be attached in a higher-level assurance case in which the
avoidance of different accidents is argued along with the addressing of
system design requirements.
Structure
106
Participants
accidentAvoidance: it is the top-level claim, which defines the objective of the pattern. This is a public goal, which may be referenced by a
higher-level safety case that argues over the safety of the system by avoiding different accidents. The linking can be performed using an away goal
reference.
accidentCause: this is used to justify that, if the {accident Y} is not
prevented, it will cause {consequences U}.
systemDef: it describes the characteristics of the concerned system and
its operating role. If it has been already used in a higher-level safety
case, it can be omitted.
accidentContext: this element describes the context of the {accident
Y}.
hazAddr: it explains better why its top-level claim is true, stating that
the hazards in the {accident Y} have been addressed. The goal is supported by the argumentation that each identified hazard has been mitigated or eliminated.
hazDef: it describes the nature and the characteristics of the identified
hazards.
hazIdent: this assumption claims that all the hazards that can lead to
the {accident Y} have been discovered and identified. If it is needed, an
evidence supporting this assumption can be attached.
argHazAddr: this strategy provides an argumentation for the attenuation of the identified hazards by discussing each of them separately.
There can be more than one hazard for {accident Y}.
specHazAddr: This goal is used to claim that the specific {hazard W}
has been attenuated, either by mitigating or eliminating it.
specHazContext: This element describes the context of the {hazard
W}.
specHazSolution: This element introduces the {solution Z} for the
{hazard W}. There could be more than one solution for the hazard W.
Collaborations
• accidentAvoidance introduces the claim about the avoidance of
an identified accident. This claim is supported by the argumentation over the n addressed hazards introduced by hazDef.
• hazIdent is useful to assume that no other hazards from the {system X} can contribute to the {accident Y}.
• all the specHazSolution elements provide a complete explanation
about the hazards resolution.
107
Applicability
This is a specific pattern that can support a higher-level hazard-directed
safety argument, in which the system safety is argued by avoiding different identified accidents. This pattern should be applied when the hazards
in the concerned accident are well clear and have been completely identified.
One possibility is to apply this pattern by reusing the experience of an
accident or an event happened in the concerned context.
The elements accidentContext, hazDef and specHazContext need
to be clearly described.
Consequences
After instantiating this pattern a number of undeveloped goals (n, where
n = # of identified hazards) will remain:
• specHazSolution (n of): for each solution it is necessary to support this claim, namely that the implemented solution will actually avoid the related hazard. Both qualitative and quantitative
evidences can be provided to support this claim.
In addition, if it is needed to support the hazIdent assumption, an
evidence supporting it should be used.
Implementation
A top-down approach should be used by instantiating the goals and the
related contexts before the strategy element.
This pattern assumes that all the possible hazards in the concerned accident have been already identified. Each hazard should be discussed
one by one, starting from the specHazAddr claim up to the needed
evidences.
Possible pitfalls
• Not correctly describing the concerned accident in the upper part of
the pattern may lead to an ambigous explanation of the assurance
case.
• Not exhaustively identifying all the hazards in the concerned accident described in hazDef context element may lead to an incomplete argumentation on the accident avoidance.
• Not providing all the needed evidences supporting the specHazSolution claim may lead to an unconvincingly argumentation on
the hazards addressing.
108
Example
Related
Patterns
• Hazard Avoidance Pattern: this is a pattern that assures the
safety of a system arguing over different identified hazards.
Though, it is too generic to be directly used to assure a system’s
property in a specific situation.
109
Bibliography
[1] O. Bowen and V. Stavridou, “Safety Critical Systems, Formal Methods and
standards”, Software Engineering Journal, 1982
[2] U.S. National Transportation Safety Board (NTSB), “The Investigative Process”, http://www.ntsb.gov/investigations/process/Pages/default.aspx
[3] Lebow, C. Cynthia, L.P. Sarsfield, W.L. Stanley, E. Ettedgui, and G. Henning, “Safety in the Skies: Personnel and Parties in NTSB Aviation Accident
Investigations”, Santa Monica, California: RAND, 1999
[4] U.S. Department Of Energy (DOE) Handbook, “Accident and Operational
Safety Analysis Volume I: Accident Analysis Techniques”, July 2012
[5] The Dutch Safety Board, “Crashed during approach, Boeing 737-800, near Amsterdam Schiphol airport, 25 February 2009”, May 2010
[6] “IEEE Standard Adoption of ISO/IEC 15026-1 - Systems and Software Engineering - Systems and Software Assurance - Part 1: Concepts and Vocabulary”,
November 2014
110
[7] “IEEE Standard Adoption of ISO/IEC 15026-2:2011 - Systems and Software
Engineering - Systems and Software Assurance - Part 2: Assurance case”,
September 2011
[8] “IEEE Standard Adoption of ISO/IEC 15026-3 - Systems and Software Engineering - Systems and Software Assurance - Part 3: System integrity levels”,
June 2013
[9] “IEEE Standard Adoption of ISO/IEC 15026-4 - Systems and Software Engineering - Systems and Software Assurance - Part 4: Assurance in the life cycle”,
August 2013
[10] S. Wilson, J. McDermid, P. Fenelon and P. Kirkham, “No More Spineless safety
Cases: A Structured Method and Comprehensive Tool Support for the Production of Safety Cases”, 2nd International Conference on Control and Instrumentation in Nuclear Installations (INEC’95), Cambridge, UK 1995
[11] S. Toulmin, “The Uses of Argument”, (1958; 2nd edn, 2003)
[12] “GSN
Community
Standard
Version
1”,
November
2011,
http://www.goalstructuringnotation.info
[13] C. Alexander, “A Pattern Language: Towns, Buildings, Construction”, Oxford
University Press, 1977
[14] T.P. Kelly, “Arguing Safety: A Systematic Approach to Managing Safety Cases”,
PhD Thesis, University of York, 1998
[15] E. Gamma, R. Helm, R. Johnson, J. Vlissides, “Design Patterns: Abstraction
and Reuse of Object-Oriented Design”, ECOOP’93 - Object Oriented Programming, 7th European Conference, Kaiserslautern, Germany 1993
111
[16] T.P. Kelly, J.A. McDermid, “Safety Case Construction and Reuse using Patterns”, in 16th International Conference on Computer Safety, Reliability and
Security, SAFECOMP, 1997
[17] W.S. Greenwell, E.A. Strunk, J.C. Knight, “Failure analysis and the safetycase lifecycle”, IFIP Working Conference on Human Error, Safety and System
Development (HESSD), Toulouse, France 2004
[18] W.S. Greenwell, “Pandora: An Approach to Analyzing Safety-Related DigitalSystem Failures”, PhD Thesis, University of Virginia, 2007
[19] A. Ruiz, I. Habli, H. Espinoza, “Towards a Case-Based Reasoning Approach for
Safety Assurance Reuse”, 1st Workshop on Next Generation of System Assurance Approaches for Safety Critical Systems (SASSUR), Magdeburg, Germany
2012
[20] Y. Matsuno, “A design and implementation of an Assurance case language”,
44th Annual IEEE/IFIP International Conference on Dependable Systems and
Networks (DSN), 2014
[21] R. Alexander, T. Kelly, Z. Kurd, J. McDermid, “Safety cases for advanced control software: Safety case patterns”, Technical report, Department of Computer
Science, University of York, 2007
[22] E. Denney, G. Pai, “A formal basis for safety case patterns”, 2nd International
Conference, SAFECOMP, Toulouse, France 2013
[23] R. Hawkins, T. Kelly, “A software safety argument pattern catalogue”, Technical
report, The University of York, 2013
[24] T. Kelly, J. McDermid, “Safety case construction and reuse using patterns”, in
SAFECOMP, pages 55-69, 1997
112
[25] R. A. Weaver, “The Safety of Software - Constructing and Assuring Arguments”,
PhD thesis, Department of Computer Science, University of York, 2003
[26] NASA, “Overview of the DART Mishap Investigation Results, for Public Release”, May 2006
[27] S. Nair, J.L. de la Vara, M. Sabetzadeh, L.C. Briand, “An extended systematic
literature review on provision of evidence for safety certification”, Information
and Software Technology Volume 56, Issue 7, July 2014
[28] Public Safety and Homeland Security Bureau (PSHSB), “April 2014 Multistate
911 Outage: Cause and Impact”, October 2014
[29] International Organization for Standarization (ISO), “ISO Standard 9126-1:
Software engineering - Product quality - Part 1: Quality model”, 2001
113

Accident Avoidance Pattern: Improving Knowledge for Safety critical

Transcript

Documenti analoghi

Words - Instant English

19-21 May 2014 Hotel Palais des Roses Agadir, Morocco Keynote

determination of thyreostats in bovine urine and thyroid glands by

l`attivita` del ricostruttore cinematico in italia

songs - Company Blu

13. ROBIN DELLE STELLE

as 2009-2010 Programma d`Inglese VC Giannina Perrucchini

Elementary Affine Logic and the Call by Value Lambda Calculus

Water related habitats in Libyan Sahara desert: Kel Tadrart Tuareg

The Constitutional Architecture of the Economic Governance in the EU

Deformation history of Mauna Loa (Hawaii) from 2003 to 2014

GSAF 1989.02.02 DATE: Thursday February 2, 1989 LOCATION

JRC Report - Safety of offshore oil and gas operations: Lessons from

What recognition of work-related mental disorders? a study on 10