Accident Avoidance Pattern: Improving Knowledge for Safety critical
Transcript
Accident Avoidance Pattern: Improving Knowledge for Safety critical
Scuola Politecnica e delle Scienze di Base Corso di Laurea Magistrale in Ingegneria Informatica Tesi di Laurea Magistrale in Impianti di Elaborazione Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Anno Accademico 2013/2014 Relatore Ch.mo Prof. Domenico Cotroneo Correlatore Ch.mo Dr. Roberto Pietrantuono Ch.mo Ing. Fumio Machida, NEC (Japan) Candidato Mirko Napolano matr. M63000382 A mamma e papá, che mi hanno tanto supportato, e tanti sacrifici hanno fatto per permettermi di arrivare fin qui. Acknowledgements Ringrazio innanzitutto il mio relatore Prof. Domenico Cotroneo, che mi ha supportato durante il lavoro di tesi e mi ha concesso la possibilitá di svolgere un tirocinio importante presso i laboratori della NEC Corporation in Giappone. Soprattutto, ha rappresentato per me una guida ed un riferimento costante. Ringrazio il mio correlatore Ing. Roberto Pietrantuono, che mi ha gentilmente seguito nella preparazione della tesi mettendo a disposizione la sua esperienza ed il suo tempo. Ringrazio inoltre il mio supervisor presso la NEC Ing. Fumio Machida, che mi ha accolto nel suo gruppo di ricerca e mi ha permesso di avviare il lavoro di tesi. E’ stato per me un esempio di professionalitá e gentilezza. Grazie ai miei compagni d’universitá, in ordine Dario, Fabrizio, Gaetano, Giovanni, Mario, Pierluca e Raffaele, con cui ho condiviso gioie, ansie, progetti ed ore piccole davanti a Na tazzulell ’e café. Grazie agli irriducibili amici di classe, che dopo tanti anni rendono ancora le giornate piú allegre e leggere. Grazie in modo particolare a Luigi, Davide, Giuseppe ed Emanuele, perché con loro é Tutta ’nata storia. Grazie a Marika, fidata amica che mi comprende e che é sempre presente, nel bene e nel male. Grazie alla mia famiglia, che mi ha permesso d’inseguire i miei sogni, credendo in me. Grazie a Marialuisa, anima pura che mi ha accompagnato in questo cammino e che é sempre stata al mio fianco. Su di lei, Dubbi non ho. Spero di essere stato all’altezza della vostra stima e di esserlo in futuro. III Contents Introduction 10 1 Accident knowledge in safety domains 1.1 Safety critical systems . . . . . . . . . 1.2 Accident investigation . . . . . . . . . 1.2.1 NTSB investigation process . . 1.3 Considerations . . . . . . . . . . . . . 13 13 14 16 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Assurance case, GSN and patterns 2.1 Safety case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 ISO/IEC 15026 Standard: Systems and Software Assurance . . . . 2.2.1 Part 1 & Part2: Formalization of assurance case concepts and structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Part 3: System Integrity Levels . . . . . . . . . . . . . . . . 2.2.3 Part 4: Assurance in the life cycle . . . . . . . . . . . . . . . 2.3 Goal Structuring Notation (GSN) . . . . . . . . . . . . . . . . . . . 2.3.1 The description of the assurance case . . . . . . . . . . . . . 2.4 Safety case patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Representation through GSN . . . . . . . . . . . . . . . . . 2.4.2 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Safety case lifecycle . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Reuse of safety cases . . . . . . . . . . . . . . . . . . . . . . . . . . 20 . 20 . 22 . . . . . . . . . . . . 23 26 27 29 31 32 36 37 39 42 42 44 IV Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains 3 ECFMA and Accident Avoidance Pattern: the methodology 3.1 The methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Event and Causal Factor Mitigation Analysis (ECFMA) . . . . 3.2.1 The standard ECFA . . . . . . . . . . . . . . . . . . . . 3.2.2 The enhanced ECFMA . . . . . . . . . . . . . . . . . . . 3.3 Accident Avoidance Pattern . . . . . . . . . . . . . . . . . . . . 3.3.1 Construction and formalization . . . . . . . . . . . . . . 4 Case studies 4.1 DART spacecraft collision . . . 4.1.1 Accident and system role 4.1.2 ECFMA analysis . . . . 4.1.3 Assurance case . . . . . 4.2 Multistate 911 outage . . . . . . 4.2.1 Accident and system role 4.2.2 ECFMA analysis . . . . 4.2.3 Assurance case . . . . . 4.3 Discussion on the methodology . . . . . context . . . . . . . . . . . . . . . context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 47 49 49 53 55 58 . . . . . . . . . 63 63 63 65 72 82 82 84 90 97 Future work 103 Conclusion 104 A Accident Avoidance Pattern formalization 105 V List of Figures 1.1 ECF chart example . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 Example of safety case [12] . . . . . . . . . . . . . . Assessment of integrity levels [8] . . . . . . . . . . . GSN basic elements [12] . . . . . . . . . . . . . . . . GSN example [12] . . . . . . . . . . . . . . . . . . . . Use of Public Indicator . . . . . . . . . . . . . . . . . Example: Functional Decomposition Pattern [14] Safety-case lifecycle [17] . . . . . . . . . . . . . . . . Case-based reasoning for safety case [19] . . . . . . 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 The methodology . . . . . . . . . . . . . . . . . ECF chart elements [4] . . . . . . . . . . . . . . ECF analysis [4] . . . . . . . . . . . . . . . . . . ECFMA example . . . . . . . . . . . . . . . . . . Example: Hazardous Contribution Software ment [23] . . . . . . . . . . . . . . . . . . . . . . . Example: Hazard Avoidance Pattern [14] . . Six step process . . . . . . . . . . . . . . . . . . Accident Avoidance Pattern . . . . . . . . . . . 4.1 4.2 4.3 4.4 4.5 ECFMA ECFMA ECFMA ECFMA ECFMA chart chart chart chart chart of of of of of DART DART DART DART DART collision: collision: collision: collision: collision: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Safety . . . . . . . . . . . . . . . . . . . . bird’s view . . . . initial events . . middle events . . upper conditions final events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 28 33 34 35 38 43 44 . . . . . . . . . . . . . . . . Argu. . . . . . . . . . . . . . . . . . . . 48 50 51 54 . . . . 56 57 59 60 . . . . . . . . . . 66 67 68 69 70 . . . . . . . . . . . . . . . . . . . . . . . VI Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Assurance case of DART accident: bird’s view . . . Assurance case of DART accident: top-level claim . Assurance case of DART accident: first excerpt . . Assurance case of DART accident: second excerpt . Assurance case of DART accident: third excerpt . . Assurance case of DART accident: fourth excerpt . Assurance case of DART accident: last excerpt . . . Washington NG911 Transition architecture [28] . . ECFMA chart of 911 outage: bird’s view . . . . . . ECFMA chart of 911 outage: initial events . . . . . ECFMA chart of 911 outage: accident . . . . . . . . ECFMA chart of 911 outage: post-accident events . Assurance case of 911 outage: bird’s view . . . . . . Assurance case of 911 outage: top-level claim . . . . Assurance case of 911 outage: first excerpt . . . . . Assurance case of 911 outage: second excerpt . . . . Assurance case of 911 outage: third excerpt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 74 76 77 78 79 80 83 85 87 88 89 91 92 94 95 96 VII List of Tables 2.1 2.2 Documentation of a safety case pattern, part 1 . . . . . . . . . 40 Documentation of a safety case pattern, part 2 . . . . . . . . . 41 3.1 Pattern catalogue taken from reference [20] . . . . . . . . . . . 55 4.1 4.2 4.3 4.4 DART’s collision: list of identified hazards . . . . . . . . . . 911 outage: list of identified hazards . . . . . . . . . . . . . DART’s collision: correspondence with recommendations 911 outage: correspondence with recommendations . . . . . . . . . . . . 71 86 97 98 VIII Acronyms DART Demonstration of Autonomous Rendezvous Technology ECFA Event and Causal Factor Analysis ECFMA Event and Causal Factor Mitigation Analysis GSN Goal Structuring Notation ISO/IEC 15026 IEEE Standard: Systems and Software Assurance MIB Mishap Investigation Board MUBLCOM Multiple Paths, Beyond-Line-of-Sight Communications NASA National Aeronautics and Space Administration NG911 Next Generation 911 NTSB National Transportation Safety Board PSHSB Public Safety and Homeland Security Bureau SIL Safety Integrity Level IX Introduction In traditional safety critical domains, like avionics, aerospace, automotive and railway, computer systems are used intensively to perform regular operations and accomplish objectives. Moreover, the use of such systems to monitor and manage critical functionalities has become important for other kinds of infrastructures, such as emergency communication networks, gas pipeline and nuclear plants, in which safety must be guaranteed. System providers need to reduce risks of system failures as much as possible, since such failures can lead to catastrophic consequences, like infrastructure damages, injuries and business losses. However, even if engineers follow safety standards and apply assessed methodologies during the system design, accidents can always happen. Not only, similar accidents often happen again. So, it is critical to analyze the events and assess the causes in order to avoid the occurrence of the same accident. Whenever a relevant accident turns up, public agencies, which are responsible for the safety in that domain and geographic area, investigate on the mishap to reconstruct events and causes. The process lasts many months in which all the possible stakeholders are involved. As result of this work, the investigative body releases a final report along with a list of safety recommendations. This list contains some guidelines that the involved companies and regulators should apply to mitigate 10 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains or eliminate the identified hazards. Yet, the main problem of such recommendations is that they are released for the stakeholders in an unstructured manner. Since this information is also useful for other companies working in the same domain, it is difficult for them to improve effectively their knowledge about a similar accident. The goal of this thesis is to present a methodology to analyze the mishap from the final reports, extract the causes and provide a structured way to present the achieved knowledge. For the accident analysis, it has been used Event and Causal Factor Analysis (ECFA). It is a tool widely used by investigative agencies to describe events and conditions and identify accident causes. We have introduced an enhancement that provides a logical relationship between causes and possible solutions by producing Event and Causal Factor and Mitigation Analysis (ECFMA). After the analysis, we have used an “assurance case” to argue that the solutions are adequate to mitigate the discovered hazards. An assurance case is a structured argumentation supported by a body of evidence intended to justify a system property. In order to provide an effective, systematic structure, a new assurance case pattern, namely Accident Avoidance Pattern, has been created to elucidate the accident knowledge by arguments and evidences. This approach allows engineers belonging to another company to reuse this accident knowledge in a more understandable and effective way for improving design and operation. In order to evaluate the methodology, it has been applied with reference to two case studies concerning two different domains, aerospace and communication network. Part of the thesis has been developed during a three-month internship at NEC Laboratory for Analysis of System Dependability (LASD) in Kawasaki City, Japan, where the author had the possibility to define the methodology. 11 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains The thesis is structured as follows: Chapter 1 provides an overview of safety critical systems and accident investigations as performed by most of investigative boards highlighting the limitations of this approach in improving the accident knowledge. Chapter 2 introduces assurance case and GSN, which are basic preliminary concepts within the context of the proposed methodology. The chapter also describes the use of safety case patterns as a way to instantiate repetitively a structured and successful argumentation about the safety of system. Chapter 3 illustrates the proposed methodology. Specifically, it shows how to perform the analysis through ECFMA and how to improve the accident knowledge through the use of the Accident Avoidance Pattern. Chapter 4 describes two case studies used to evaluate the approach. They deal with different domains, aerospace and communication network, in order to show how it is possible to apply the methodology in different contexts. In the last part of this chapter there is a discussion on the methodology. We list advantages and drawbacks of the approach by focusing on the results of the case studies. At the end of the discussion there is a look at the future by listing the possible improvements of the approach and the pattern. 12 Chapter 1 Accident knowledge in safety domains 1.1 Safety critical systems Computer systems are employed more and more in contexts where their possible failure can have catastrophic consequences. Software and hardware systems control several critical infrastructures, without whose support it would not be possible to manage to accomplish their task because of their extension and complexity. Traditionally, some critical domains exist in which computer systems have always played a fundamental role in controlling the process. In domains such as aviation, aerospace and railway, the computer architecture must be ready to react to unforeseen events and conditions so that a mishap can be avoided. If the system does not react correctly and/or quickly, it is very likely that an accident will happen. With the increasing complexity of social infrastructures like nuclear plants, power grid and communication networks, the role of IT systems has become crucial. Originally, computers have been used to monitor a process, such as the flow of water in a hydroelectric power plant or the temperature in a nuclear plant, but other capabilities like remote control or automated scheduled procedures have not been 13 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains implemented in the past. Nowadays, these kind of functionalities can be also provided in such infrastructures by integrating IT systems with mechanical or hydraulic systems. Since these complex systems have a great impact on both people and business processes, it is very important to minimize the probability that a mishap in these environments occurs. Such a kind of system is called safety-critical system. It is defined as “a system whose failure or malfunction can cause damage to people (death or injuries), damage to properties, environmental harm and/or loss of money (direct or indirect)”, while safety is “a measure of the continuous delivery of service free from occurrences of catastrophic failures” [1]. Safety is an internal property of the system, but safe system can never be guaranteed. However, if risks of catastrophic failures can be controlled and brought within the acceptable limits, then such a system can be considered safe. Although several techniques and methodologies are applied to control and limit such risks, accidents can always occur, even not so rarely. Some famous examples are the 2011 Fukushima Daiichi nuclear disaster, the 2003 Italy blackout or the 2009 Washington Metro train collision. 1.2 Accident investigation As soon as an accident occurs, it is very important for system providers to understand what was going on in order to improve the safety knowledge and avoid the occurrence of similar accidents. If the mishap is relevant enough regarding damages or injuries, the analysis will be conducted through a detailed investigation. After a severe accident has happened in a certain geographic area, the indepen- 14 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains dent public agency that has a responsibility on the concerned area will start an investigation to understand the facts and determine the causes. Examples of public agencies are the American National Transportation Safety Board (NTSB), European Aviation Safety Agency (EASA) and Japan Transport Safety Board (JTSB). Some of them, such as NTSB, investigate on several different transport domains (aviation, highway, intermodal, marine, pipeline, railway), while other organizations analyze the events occurred in a specific, related domain (i.e. NASA can carry out an investigation on an accident involving one of its spacecraft). However, in both the cases, at the end of the investigation a final report is published to present the findings. In order to support their investigation, these agencies use different tools and techniques to assess facts and causes. The first step of the investigation process is usually to understand what happened before the accident. One of the most popular tool for this purpose is Event and Causal Factor Analysis (ECFA). It is a tool widely used by investigative agencies to describe events and conditions and represent accident causes. It is adopted and described by U.S. Department of Energy (DOE) in its handbook as the first stage of the accident investigation [4]. Figure 1.1 provides the basic structure of an ECF chart which will be further presented in chapter 3. Basically, this represents only the first step of an investigation, in which different evidences are structured and organized. However, the process is quite long and it is composed by several different accident analyses. The role of ECFA in the investigation is to assess the timeline of events, with conditions in place at the moment of the accident. After this, ECF chart is updated by the next accident analyses, whose goal is to assess the causes. Once they have been identified, they are connected to the ECF chart’s element. 15 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Figure 1.1: ECF chart example 1.2.1 NTSB investigation process The investigation process performed by most of the public agencies and organizations is basically the same. So, we can describe the process used by NTSB agency, because it is considered as “the most important independent safety investigative authority in the world” and “the international standard” about accident investigations [3]. The Board has investigated approximately 124,000 aviation accidents and 10,000 surface transportation accidents since its inception in 1967. From reference [2], “the National Transportation Safety Board was established in 1967 to conduct independent investigations of all civil aviation accidents in the United States and major accidents in the other modes of transportation. It is not part of the U.S. Department of Transportation, nor affiliated with any of DOT’s modal agencies, including the Federal Aviation Administration (FAA). The Safety Board investigations focus only on improving transportation safety”. In the first hours from the notification of a severe accident, the NTSB forms a Go Team that heads towards the accident scene as quickly as possible to begin the investigation. The Go Team is composed by different specialists who are responsible 16 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains for a clearly defined portion of the accident investigation. NTSB has published the lists of specialties involved in the investigation for the aviation domain [2]: • Operations: description of the narrative of flight and crew members’ duties. • Structures: documentation of the airframe wreckage and the accident scene. • Power plants: examination of engines (and propellers) and engine accessories. • Systems: study of components of the plane’s hydraulic, electrical, pneumatic and associated systems, together with instruments and elements of the flight control system. • Air Traffic Control: reconstruction of the air traffic services given the plane, including acquisition of ATC radar data and transcripts of radio transmissions. • Weather: collection of weather data for a broad area around the accident scene. • Human Performance: study of crew performance and all before-the-accident factors that might be involved in human error. • Survival Factors: documentation of impact forces and injuries, evacuation, community emergency planning and all crash-fire-rescue efforts. Under direction of the Investigator-in-Charge, each of these NTSB investigators heads a “working group” in one area of expertize. The groups are staffed by representatives of the “parties” to the investigation (specifically, the Federal Aviation Administration, the airline, the pilots’ and flight attendants’ unions, airframe and engine manufacturers). The NTSB designates other organizations or corporations as parties to the investigation. Other than the FAA, which by law is automatically designated a party, the NTSB has complete discretion over which organizations it designates as parties 17 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains to the investigation: “only those organizations or corporations that can provide expertize to the investigation are granted party status and only those persons who can provide the Board with needed technical or specialized expertize are permitted to serve on the investigation” [2]. The investigation lasts many months in which all the possible entities (companies, regulators, emergency bodies, etc...) are involved, in order to reconstruct the events and assess the possible causes. At the end of this process, descriptions of facts and analysis are summarized in a draft final report by the Safety Board staff. Once a major report is adopted, an abstract of that report - containing the Board’s conclusions, probable cause and safety recommendations - is published. One of the results of the investigation is the list of safety recommendations. They are guidelines and advices that stakeholders should implement and address in order to avoid a similar accident. Recommendations “usually address a specific issue uncovered during an investigation or study and specify how to correct the situation. Letters containing the recommendations are sent to the organization best able to address the safety issue, either a public or a private one” [2]. In fact, they usually refer to underlying problems and organizational deficiencies, while technical problems are only indicated in the analysis without mentioning by a recommendation. 1.3 Considerations After the investigation, a source of information is available, which is the accident knowledge. Such experience is valuable not only for the stakeholders but also for third-party companies that can effective use such lessons learned. Although the recommendations provided by the report are relevant for third- 18 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains party engineers, they need to contextualize such guidelines before to effectively apply them to their systems. Moreover, as stated before, technical problems are not addressed by recommendations, even if they are described in detail or just mentioned in the description of facts or in the post-failure analysis. For example, this is a recommendation from the final report issued by Dutch Safety Board for an aircraft crash happened on February 2009 in Nederland: “The FAA (Federal Aviation Administration) and EASA (European Aviation Safety Agency) should ensure that the undesirable response of the autothrottle and flight management computer caused by incorrect radio altimeter values is evaluated and that the autothrottle and flight management computer is improved in accordance with the design specifications” [5]. However, there are neither descriptions of technical solutions nor references to them through the report. Our goal is to reuse all the knowledge from an accident, considering both technical and operational inadequacies, and to structure the information in an effective way. For this purpose, we have developed a methodology which combines the use of ECFA, as a standard way to reconstruct the events and identify the causal factors, and the assurance case, as a way to provide a structured argumentation for a system’s property [6]. ECFA has been improved to be used not only as an accident causation model, but also as a guide to identify possible solutions directly connected to the causal factors. Regarding the assurance case, a new pattern has been developed to provide a recurring way to structure the accident knowledge identified in the previous step. 19 Chapter 2 Assurance case, GSN and patterns In this chapter we present the context of assurance cases by providing the preliminary background within the context of the proposed methodology. The discussion deals also with a graphical way to represent the assurance case, namely Goal Structuring Notation, and a way to reproduce the basic structure of an argument as a template. 2.1 Safety case The idea behind the assurance case is not a novelty in the industry. As the complexity of critical systems increases, it has become important to assess the safety of this kind of systems. Originally, safety cases have been widely used as a way to demonstrate that the system is acceptably safe. It is a structured argumentation composed by claims and supported by evidences. The term arises from HSE (Health and Safety Executive) in UK, but it has been widely accepted in different critical domains as a certification tool. Figure 2.1 shows an example of a safety case, provided by Origin Consulting (York) Limited, on behalf of Contributors in [12]. 20 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Figure 2.1: Example of safety case [12] 21 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Safety case has been a de facto standard for a long time in the industry for the certification of safety critical systems. Since the structured argumentation supporting the safety case has turned out to be persuasive, it has been thought to generalize this artefact to argue over different system’s properties. Its concept can be used to assure that any kind of claim is true if there is a convincing argument to support it. 2.2 ISO/IEC 15026 Standard: Systems and Software Assurance Starting from this point, in 2011 the IEEE Software and Systems Engineering Standards Committee (S2ESC) undertook a long-term program to harmonize its standards with those of ISO/IEC JTC 1/SC 7, the international standards committee for software and systems engineering. The goal of the committee’s work was to define and organize a set of concepts and relationships in order to establish a basis for shared understanding across user communities for assurance. The final result has been the development of an IEEE Standard, ISO/IEC 15026 - Systems and Software Engineering - Systems and Software Assurance. ISO/IEC 15026 standard consists of the following parts: • Part 1: Concepts and vocabulary [6] • Part 2: Assurance case [7] • Part 3: System integrity levels [8] • Part 4: Assurance in the life cycle [9] 22 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Each document has been first issued as a draft, later it has been accepted and adopted as an IEEE standard. The final assessment of the documents was completed in November 2014. We present below an overview on the contents of each part. 2.2.1 Part 1 & Part2: Formalization of assurance case concepts and structure Part 1 introduces the basic concepts and definitions of assurance-related terms. From reference [6], the assurance case is a “reasoned, auditable artifact created which supports the contention that its top-level claim (or set of claims) is satisfied, including systematic argumentation and its underlying evidence and explicit assumptions which support the claims(s)”. An assurance case contains the following elements and their relationships: • one or more claims about properties; • arguments that logically relate the evidence and any assumptions to the claim(s); • a body of evidence and possibly assumptions supporting these arguments for the claim(s); • justification of the choice of top-level claim and the method of reasoning The assurance is defined as “ground for justified confidence that a claim has been or will be achieved” Part 2 describes the minimum requirements for the structure and contents of an assurance case to improve consistency and comparability of assurance cases and to facilitate stakeholders communications and engineering decisions. 23 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains The following elements represent the structure of an argumentation according to the standard ISO/IEC 15026. The definitions are provided from both Part 1 and Part 2: • claim: “a true-false statement about the limitations on the values of an unambiguously defined property - called the claim’s property - and limitations on the uncertainty of the property’s values falling within these limitations during the claim’s duration of applicability under stated conditions” • justification: “a reason why a claim has been chosen” (e.g. result of risk assessments, result of requirements analysis, explanations) • assumption: “a proposition without any reason why it is true” • evidence: “a fact, datum or object” supporting a claim (e.g. documents, test results, measurement results, process, product) • argument: “a reason why a claim is true”. An argument is used to show how the components directly underlying it, such as claims and evidences, are related to a claim or a set of claims. It can use different methods of reasoning: – quantitative (deterministic - e.g. formal proof; non deterministic - e.g. probabilistic, game theoretic, fuzzy sets) – qualitative (e.g. staff performance evaluation, court judgements, qualitative statements of event causality) Formally, we can define the assurance case as a quadruple of a claim c, a justification j of c, a set es of evidence and an argument g which assures c using es. 24 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains We can also provide a recursive definition of the assurance case, where × is the direct product: A0 = C × {j0 ∈ J(c0 ) | c0 ∈ C} × φf (E) × {g0 ∈ G(c0 , es0 ) | c0 ∈ C, es0 ∈ φf (E)} (2.1) Given A0 , the set A of assurance cases and the set of evidence E are defined as follows: A = {(c, j, es, g) ∈ A0 | j ∈ J(c), g ∈ G(c, es)} (2.2) E =F +D+O+C +A (2.3) where J(c) is the set of all the justifications for a claim c; C is the set of claims; φf (E) is the set of all the finite subsets of E; G(c0 , es0 ) is the set of arguments which assures a claim c0 using a set es0 of evidence; F is the set of facts; D is the set of data; O is the set of objects. 25 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Guidelines In order to be accepted as an effective argumentation, the assurance case needs to follow some rules: • The components of an assurance case shall be unambiguous, identifiable, and accessible • An assurance case shall have one that is the ultimate goal of its argumentation • An argument shall be supported by one or more claims, evidence, or assumptions • A claim shall be supported either by just one argument, or by one or more claims, evidence, or assumptions. Therefore, a claim is never a bottom element of an assurance case • A claim, evidence, or assumption shall not support itself either directly or indirectly • A top-level claim shall have a justification for its choice • If an assumption is partially warranted or contradicted by evidence, this evidence shall be associated with it • If an assurance case incorporates another assurance case, the incorporated assurance case’s top-level claim shall be placed within the original assurance case’s structure at points where the claim is allowed • An evidence should be uniquely identified (so that arguments can uniquely reference the evidence), verifiable and auditable 26 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica 2.2.2 Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Part 3: System Integrity Levels Part 3 of the standard specifies the concept of system integrity level as a way to reach an agreement among stakeholders about the achievement of an objective. The most common use is to assure that the system or product has property values which limit related risks during operations. Integrity levels and standards utilizing them have a significant history especially in safety. A previous standard about Safety Integrity Levels (SIL) exists: ISO 61508, namely “Functional safety of electrical/electronic/programmable electronic safetyrelated”, defines functional safety as part of the overall safety which depends on a system or equipment operating correctly in response to its inputs. In particular, this standard defines 4 SILs according to the average probability of a failure, where SIL #1 means the highest range of probability, namely the last critical level. ISO/IEC 15026 standard provides the basis for a generalized concept of system integrity level which can be applied not only to the safety but also to other system’s properties like reliability, maintainability and security. According to the standard, an integrity level is a claim that “includes limitations on a property’s values, the claim’s scope of applicability, and the allowable uncertainty regarding the claim’s achievement”. An integrity level requirement is “a set of specified requirements imposed on aspects related to a system, product or element and associated activities in order to show the achievement of the assigned integrity level. This includes the evidence to be obtained”. The assessment of system integrity level and the integrity level of the elements composing the system, is based on risk analysis results and system decomposition. Given that the set of integrity levels is used correctly and that the integrity level claim concerning the system or product operations is true, the applicable risks are 27 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains limited or managed acceptably. Figure 2.2 shows an overview of the process for determining integrity levels. Figure 2.2: Assessment of integrity levels [8] In order to show conformance to the integrity levels defined in this standard, documentation shall exist that is accurate, available as required, controlled, traceable, and reviewable. 28 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica 2.2.3 Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Part 4: Assurance in the life cycle Part 4 specifies the possible utilization of an assurance case during the system life cycle. Moreover, a property-independent list of processes, activities and tasks to achieve the claim about the system critical property and show the achievement of the claim itself is presented. The three main uses of an assurance case are here summarized: • for an agreement: a supplier needs to show to an acquirer the achievement of an assurance claim about the values of a critical property of the system or software product. The agreement might be both a written contract or a verbal communication. • for regulation: an authoritative body can use the assurance case developed by the provider to verify if some critical system properties have been correctly and/or accurately implemented. The need for such regulation can arise to certify a critical property of a system or software product. • for development: the assurance case can be used as an internal asset by engineers and developers to verify if some objectives have been accomplished at a certain stage of the system life cycle. As it will be shown in the chapter about the Accident Avoidance Pattern, the proposed assurance case can be used during either an agreement between acquirer and provider or the development of the system. Afterwards, Part 4 document cites the activities and tasks which require the use and interpretation of an assurance case when a system property or an integrity 29 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains level is to be assured during the life cycle. The following is the list of all the cited processes: • Acquisition • Supply • Project planning • Decision management • Risk management • Configuration management • Information management • Stakeholder requirements definition • Requirements analysis • Verification • Operation • Maintenance For each process, guidances about activities and considerations to be performed are thoroughly described. Though, because this part is out-of-scope for this thesis, it will not be further investigated. 30 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica 2.3 Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Goal Structuring Notation (GSN) The assurance case provides a structured argument that, if sufficiently convincing, is an effective way to gain assurance among the stakeholders. Though, it is usually difficult to follow an argumentation just by reading the textual inferences deriving from claims, arguments and evidences. People can be confused by the statements so that they can refuse the argumentation even if it is actually sufficiently effective. For this purpose, a graphical notation can allow users to physically see how the elements are connected in the case. Of course, physical connection implies logic relationship. As described in the previous chapter, safety cases have been widely adopted as a certification tool as well as an artifact to support the development process. The need for a graphical notation had already arisen to describe the safety case. So, since the idea behind assurance case is a generalization of the safety case’s one, even the same graphic model can be applied for both the cases. Goal Structuring Notation (GSN) is a graphical argument notation which can be used to document explicitly the elements of an argument and the relationship that exists between these elements. GSN was originated at the University of York in the early 1990s as part of the ASAM-II project [10], and has undergone significant development and refinement since then. The early development of GSN has been heavily influenced by Toulmin’s work on argumentation [11]. Later, in his PhD work Kelly has added features to GSN in order to support the reuse of safety case patterns [14]. With the increasing popularity in using GSN to represent safety cases, industries and organizations have created the GSN Community with the aim to provide clear guidance in the use of the notation. The standard was developed between 2007 and 2011, when the version 1 was published [12]. 31 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica 2.3.1 Accident Avoidance Pattern: Improving Knowledge for Safety critical domains The description of the assurance case The purpose of GSN is to highlight how claims are supported by sub-claims and evidence through the use of argument elements. Using the Goal Structuring Notation, we can find a one-to-one correspondence between assurance case components and GSN elements. In GSN, the claims of the argument are documented as goals, the arguments are called strategies and items used as evidence are documented in solutions. For assumptions and justifications, elements with the same name exist. Moreover, a context element is used to better explain the context in which the claim or the argument should be interpreted. 32 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Figure 2.3 illustrates GSN basic elements. Figure 2.3: GSN basic elements [12] GSN standard describes claims (goals) as rectangle, contexts as rounded rectangle, arguments (strategies) as parallelogram, evidence (solutions) as circles and assumptions and justifications as ovals, with a letter to distinguish between them (A and J, respectively). The connection to claims, arguments and evidences is represented by a black arrow, while the connection to assumptions, justifications and contexts is indicated with a white arrow. This is the basic structure provided by GSN. Moreover, it is possible to use additional features to improve the explanation of the assurance case. Specifically, these extensions are useful to support argument patterns, as it will be shown later. 33 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Figure 2.4 shows an example of assurance case described with GSN. Figure 2.4: GSN example [12] First, the notation supports both the multiplicity and the optionality of the elements. In the example, “Claim 1” needs to be supported by at least one of the two evidences. In a pattern, this is very important for generalizing the use of similar multiple claims supporting an argument or another claim. Secondly, in GSN there is support to represent abstract entities. An uninstantiated entity should be used when, at some later stage, the abstract entity needs to 34 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains be replaced with a more concrete instance, while an undeveloped entity is useful when an abstract entity requires both development and instantiation. Still, these features fit with pattern’s purpose, because most of the elements in the pattern are represented as abstract entities. In Figure 2.4 ,“Evidence 3” is provided to support “Claim 3” but it needs to be instantiated in order to concretely support it, while “Claim 2” is a statement supporting the argumentation but it needs to be developed through a further argument or an evidence to be effective. Last, GSN supports modular extension that allows multiple assurance cases to be interconnected. This feature can be also used when a wide assurance case needs to be split in several smaller parts in order to better explain them separately. Figure 2.5 shows the use of the public indicator that allows any kind of element to be referenced as an away element. In this case, a goal element is made public. Figure 2.5: Use of Public Indicator As shown in this chapter, the reasoned and compelling argument provided by the assurance case is well supported by Goal Structuring Notation. It also allows the construction of complex argumentations that can be used for different purposes and in different phases of the system life cycle. 35 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica 2.4 Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Safety case patterns As discussed in the previous sections, the purpose of a safety case is to demonstrate that a system is sufficiently safe to operate. The argumentation is provided through the use of claims and evidences related to the specific system on which you are arguing. However, since many systems need to be certified about the same critical functionalities, the safety of these systems can be assured by the use of similar safety cases. Of course, each safety case is not exactly the same: for example, two of them can argue over different requirements, or they can use as evidence similar results provided by two different analyses (i.e. fault tree or reliability block diagram). Though, it is valuable that among similar specific safety cases a reproducible pattern emerges through the argumentation. The concept of pattern is well known in many contexts as a general, reusable solution to a commonly occurring problem. In the context of software engineering, regarding the Design pattern, Christopher Alexander claims that “each pattern describes a problem that occurs over and over again in our environment, and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over, without ever doing it the same way twice” [13]. In the industry, the concept of safety case pattern has arisen whenever engineers needed to certify their systems with similar argumentations. For example, the same argument over the satisfaction of all the system requirements could be developed and specified for two different railway systems. Not only, if the pattern is sufficiently generic, it might be used in different domains by instantiating it according to the standards used in the concerned domain. However, this reusable structure was just shared within a single company, or at most in a specific domain if regulators requested the same safety case. Other organizations could not access this knowledge 36 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains for their purposes. Kelly has been the first to assess and organize the concepts behind safety case patterns [14]. He has defined the safety case pattern as “a means of documenting and reusing successful safety argument structures” described by a graphical notation, such as GSN. This work has been very important for the creation of GSN standard too, since many topics about safety case and GSN have been developed in it. The two most important features defined in [14] are the abstract representation of a generalized safety argument and the formalized documentation of a safety case pattern. 2.4.1 Representation through GSN The first step in the creation of a safety case pattern is its representation. As described in the previous chapter, GSN standard has been extended by Kelly’s work by introducing features, such as multiplicity and undeveloped entity, in order to support abstraction and modularity. Specifically, we can identify structural abstraction and entity abstraction, where the first allows generalization of the structure of an argument while the latter allows generalization of an element in the structure (claim, context, evidence, justification). This is the starting point to instantiate many times a structure which can be specialized and extended depending on system and context. Figure 2.6 shows a safety case pattern, “Functional Decomposition Pattern”, described using GSN. 37 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Figure 2.6: Example: Functional Decomposition Pattern [14] This pattern argues about the system’s safety by claiming that each implemented function is safe and there are no hazardous interactions between functions. As shown in the figure, the pattern uses a multiplicity arrow to merge claims about functions’ safety in only one claim; moreover, the abstract representation of two claims and one context element is supported by “undeveloped” and “uninstantiated” elements. This kind of pattern represents the initial part of a safety case, so it needs to be extended and specialized by developing and instantiating the elements described above. The important point is that the main structure is preserved and can be reused. 38 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica 2.4.2 Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Documentation The second step to complete the creation of a safety case pattern is the documentation. Representation is important to describe graphically the structure and the relationship between elements in the safety case. Though, it is difficult for engineers or other stakeholders to pick up a pattern from a catalogue by just looking for the graphical representation. Information such as name, context of applicability and example of uses, is fundamental to manage the pattern in the correct way. Starting from the work of Gang of Four for the documentation of Design Pattern [15], Kelly has defined the format for the description of safety case patterns. The pattern format is first described in [14], then it has been summarized in [16]. The following table is adapted from reference [16], showing the fields and the related descriptions. 39 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Pattern Name and Classification The pattern’s name should convey the essence of the pattern succinctly. A good name is vital because, with use, it will become part of your design vocabulary. Intent A short statement that answers the following questions: What does the pattern do / represent? What particular safety issue / requirement / process does it address? Also Known As Other well known names for the pattern, if any. Motivation A scenario that illustrates a safety issue / process and how the elements of the goal structure solve the problem. The scenario will help you understand the more abstract description of the pattern that follows. Structure A graphical representation of the pattern using the extended form of the goal structuring notation. The representation can describe a product or a process style goal structure. Where the structure indicates generality or optionality, it should be clear how the pattern can be instantiated. Participants The elements of the goal structure and their function in the pattern. Collaborations How the participants collaborate to carry out the function of the pattern. Table 2.1: Documentation of a safety case pattern, part 1 40 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Applicability (Necessary Context) What are the situations in which the safety case pattern can be applied? What information is required as context for the pattern to be successful (necessary inputs to the pattern)? Consequences How does the pattern support its objectives? What are the trade-offs and results of using the pattern? Implementation What pitfalls, hint or technique should you beware of when using the pattern? What degrees of flexibility are there in following the pattern? Examples Safety case sample that illustrates the instantiation of the pattern. Known Uses Examples of the patterns application in existing safety documentation should be cited. If possible, examples from two different applications should be shown. Related Patterns Safety Case Patterns that are related to this pattern, e.g. with the same motivation but different applicability conditions (e.g. different standards, different systems). For a process orientated pattern, related product (argument) patterns. For a product orientated pattern, related process patterns. Table 2.2: Documentation of a safety case pattern, part 2 41 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains As stated before, many fields of this format have been adapted from the standard used to define Design pattern. In his work, Kelly has just formalized a structure that many industries used to adopt as an internal asset. In this way, whoever needs to instantiate a safety case template can seek in a catalogue to decide the most appropriate for its purposes. 2.5 Related work 2.5.1 Safety case lifecycle In the research area of safety assurance, some works have been developed about approaches to reuse information from safety cases. The concept of safety case lifecycle has turned out to be important when engineers needed to reassess the safety of a system involved in a failure. A mishap or an accident is an evidence that the system was not sufficiently safe as assured by an associated safety case. So, the need of engineers is to determine why the safety case was not correct and how they can improve it. The first main work dealing with this topic has been produced by the research group of Prof. John Knight. The authors in [17] have proposed a framework to guide the failure analysis and the development of lessons and recommendations, starting from a pre-failure safety case and producing an enhanced post-failure safety case. The conceptual schema of the framework is shown in Figure 2.7. 42 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Figure 2.7: Safety-case lifecycle [17] The inputs of failure analysis are the original safety case, which has proved to be ineffective and needs to be improved, and the failure evidence. At the end of this process, both an updated safety case and a list of lessons and recommendations are produced to be used in the system revision. The details of how the failure analysis should be performed are provided in a subsequent work, in which a taxonomy of safety-argument fallacies is defined as well [18]. The presented lifecycle has led the way to the structuring of accident knowledge. Though, since it needs a pre-failure safety case to be revised, this approach may not be always applied because this condition could not be satisfied. In fact, it is possible that pre-failure safety case is not available, because the concerned system is too old to be documented by a safety case or because it has been upgraded too many times with studies not well related and documented. 43 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica 2.5.2 Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Reuse of safety cases Another research work related to the reuse of safety case is presented in reference [19]. In this work, the authors have developed a strategy to create safety cases by retrieving and reusing previous similar artifacts. The approach is based on the concept of case-based reasoning in which the process of solving new problems is based on the solutions of similar past problems. Figure 2.8 summarizes the approach. Figure 2.8: Case-based reasoning for safety case [19] This proposed methodology starts with a new case that needs to be processed and stored in a cases repository. The process includes the case description, which is used to make the case retrievable from the repository whenever a user is looking for it. Once that a case has been picked up from the repository, a user can revise it depending on the future use; then, the validated solution can stored in the repository. However, in this approach, the knowledge is embodied in previous safety cases, 44 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains and it is not related to the accident experience. Still, it assumes that at least one safety case is available to be reused, while we consider that no pre-failure safety cases are available. 45 Chapter 3 ECFMA and Accident Avoidance Pattern: the methodology In this chapter we present the proposed methodology to improve the accident knowledge in safety critical domains. In the first section we describe in general the phases of the approach, by showing the input, the tools and the output. In the second section the illustration of the accident causation model ECFA and our enhanced ECFMA is detailed. In the third section we discuss about existing safety case patterns and the motivations which have led us to the development of a new reusable structure, the Accident Avoidance Pattern. 46 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica 3.1 Accident Avoidance Pattern: Improving Knowledge for Safety critical domains The methodology The research work closest to our methodology is the safety-case lifecycle developed by Prof. Knight and his group. As described in the previous chapter, in their works the authors use both the pre-failure safety case and the information from the accident to perform a failure analysis that generates lessons and recommendations and a more accurate post-failure safety case [17][18]. Our approach is a bit different, because it starts from a different assumption, namely that a pre-failure safety case is not available. The evidences supporting this claim can be at least three: • the system could be too old and many upgrades have been applied during its life-cycle, with studies and assessment not well related among them; • an other company, which has been commissioned to improve the actual system, could not access the previous information for the sake of confidentiality or unavailability of data; • the system could be too complex to be completely assured by a safety case in all of its parts or functions. In [18] the author confirms the lack of documented safety arguments in most digital systems, and he proposes a way to derive them retroactively. Instead, we suppose that it is quite difficult to reconstruct a safety case just from the observation of the system. Moreover, since community has worked on a standard for assurance case by generalizing the concept of a safety case, our approach can be applied not only in safety critical systems, but also in domain where other system properties are relevant (i.e. availability, reliability or maintenability). 47 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains The proposed approach is summarized in Figure 3.1. Figure 3.1: The methodology The failure analysis is conducted using the reports published by public investigative agencies. From these documents, not only agencies’ lessons and recommendations but also events and system descriptions are used to determine the accident causes. Event and Causal Factor Mitigation Analysis (ECFMA), an enhanced version of the adopted ECFA, is used to reconstruct the events, discover the causes (root, direct and contributory causes) and provide possible solutions. After the analysis, a post-failure assurance case is developed directly from the analysis 48 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains outcomes through the instantiation of the Accident Avoidance Pattern, which argues how the identified problems can be mitigated by the provided solutions. The following sections illustrate how the two techniques are used in the methodology. 3.2 Event and Causal Factor Mitigation Analysis (ECFMA) 3.2.1 The standard ECFA The first step of our methodology is the reconstruction of events, conditions and determination of causes and solutions through the use of ECFMA. In order to present this tool, it is better to first illustrate the standard technique ECFA. As described in the first chapter, Event and Causal Factor Analysis is widely used to describe the events leading to an accident in order to relate conditions and causal factors. An example of ECF chart is provided in the first chapter from Figure 1.1. In this section, we describe how this analysis is performed. The ECF chart is a flow chart with the events and decisions plotted on a timeline. As the event timeline is established, the related conditions and information are linked to the events and decisions. Understanding why workers did what they did and why their decisions and actions made sense to them is an essential goal of the accident investigation. 49 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Figure 3.2 shows the basic elements used in a ECF chart. Figure 3.2: ECF chart elements [4] The rectangular element is the description of an event or a decision - if available, it contains also a time information. In order to reconstruct the facts, events are connected together with straight arrows. It is also possible to create a secondary chain to describe a sequence of events that has led to a critical event, representing the beginning of a primary chain leading to the accident. The primary chain usually ends with the “accident” element; if the goal is to describe a lack of mitigation after the accident, this element will be at the beginning or in the middle of the events’ chain. After the reconstruction of the events, investigators usually try to understand why a certain event has happened - i.e. why a technician has not closed a valve to mitigate a leak, why an alarm has not been activated, why a system has provided incorrect values, etc... Whenever they identify conditions that were in place at 50 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains the time of an event, they connect the oval element “condition” to the concerned event through a dotted arrow. A more complex structure of conditions can also be attached in order to determine a “chain of conditions”. The ECF analysis is performed by reconstructing all the conditions in place during the events. Investigators need to wonder why such a condition has not adequate for the situation. At the end of this process, they should find the particular condition that has originated the unsafe situation. This will be a causal factor of the accident; in ECFA it is represented by an hexagonal element and attached to events or conditions. Figure 3.3 describes the actions performed to determine the causal factors. The figure is provided by DOE handbook [4]. Figure 3.3: ECF analysis [4] 51 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains In the determination of causal factors, we can distinguish three kind of causes, all of them represented by the same hexagon in ECFA: • direct cause: it is the immediate event or condition that has caused the accident. Typically, the direct cause of the accident may be derived from the immediate, proximate event and conditions close to the accident; • root cause: it is the causal factor that, if corrected, would prevent recurrence of the same or similar accidents. A root cause may be derived from several contributing unsafe conditions. It is an higher-order, fundamental causal factor which addresses classes of deficiencies, rather than single problems or faults. • contributory cause: it is an event or condition that collectively with other causes increased the likelihood of an accident but that individually did not cause the accident itself. It also represents an event or condition which has not mitigated the unsafe chain of events leading to the accident. For instance, in a power plant the direct cause of a blackout can be the failure of a reactive power control, which has determined the outage. However, the failure has not been prevented by the management of the power control, which may represent the root cause of this accident. Finally, the blackout has not been mitigated by the redundant power control, which has turned up to be a contributory cause. 52 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica 3.2.2 Accident Avoidance Pattern: Improving Knowledge for Safety critical domains The enhanced ECFMA ECFA is useful to illustrate the events and the conditions in place in that moment. From this chart, the analysis is performed to find the causal factors that have generated the accident. ECFA is basically an accident causation model, a diagram used to assess how and why an accident has happened. So, it does not provide any information about possible countermeasures that would avoid the accident or mitigate it, because its original purpose is just to analyze the causes. However, the simple but straightforward structure of ECF chart can be also used to provide a clear depiction of a sequence of events. In our case, it can be used after the investigation process itself to represent graphically the textual narrative of the events that are written in the final reports published by investigative agencies. Moreover, if a possible solution for a discovered problem is already available, a user can easily relate it to the problem and provide an enhanced depiction of what happened and how it would have been avoided. Usually this information is contained in the final reports, in which countermeasures to direct causes, solution to contributory causes and recommendations to root causes are described or listed. The enhancement to ECFA consists in the introduction of a new element, not related to the field of accident causation models, which is the solution element. 53 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Figure 3.4 shows an example of ECFMA, with the use of element “solution”. Figure 3.4: ECFMA example As shown in the figure, the solution is directly introduced in the ECF chart by attaching it to the related causal factor. In this way, it is possible for an engineer to take in account both causes and solutions before to demonstrate that such a solution is actually effective. This is an intermediate step towards the construction of the assurance case. In fact, as it will be shown in the next section, the claims of the assurance case is directly derived from these two elements. 54 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica 3.3 Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Accident Avoidance Pattern The second step of our methodology is the construction of the assurance case from the results of ECFMA in order to structure the achieved knowledge through an argumentation. For this purpose, we first have investigated the existing safety case patterns in order to reuse an accepted structure to elucidate the knowledge. Our starting point has been the patterns catalogue provided in reference [20]. Table 3.1 lists about 20 patterns used in industry. Pattern Name Functional Decomposition Pattern High-Level Software Safety Argument Software Contribution Safety Argument SSR Identification Software Safety Argument Hazardous Contribution Software Safety Argument SW Contribution Safety Argument with Grouping Hazard Avoidance Pattern Fault Free Software Pattern ALARP (As-Low-As-Reasonably-Practicable) Pattern Component Contributions to System Hazards Hazardous SW Failure Mode Decomposition Pattern Hazardous Software Failure Mode Classification Pattern Software Argument Approach Pattern Absence of Omission Hazardous Failure Mode Pattern Absence of Commission Hazardous Failure Mode Pattern Absence of Early Hazardous Failure Mode Pattern Absence of Late Hazardous Failure Mode Pattern Absence of Value Hazardous Failure Mode Pattern Effects of Other Components Pattern Handling of Hardware/Other Component Failure Mode Handling of Software Failure Mode At Least As Safe Argument Requirements Breakdown Pattern Table 3.1: Pattern catalogue taken from reference [20] 55 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Such patterns have been completely described in other research works, according to Kelly’s representation and formalization [21][22][23][24][25]. Focused on this summary about safety case patterns, we have surveyed all the proposed templates in order to choose the most appropriate one. Yet, most of them do not seem directly applicable for our purpose. In some cases the argumentation is addressed over system requirements, which we do not gain from the accident knowledge (i.e. SSR Identification Software Safety Argument); in other cases the argument is conducted over the risk probabilities which need to be acceptably low (i.e. ALARP). In some other the argumentation on the hazard limitation is provided through both the avoidance of hazards and the addressing of system requirements (i.e. Hazardous Contribution Software Safety Argument, Figure 3.5) or the pattern argues the safety of the system over the safety of single system functions (i.e. Functional Decomposition Pattern, Figure 2.6, chapter 2.4.1). Figure 3.5: Example: Hazardous Contribution Software Safety Argument [23] 56 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Moreover, the goal of most patterns is to argue over the safety of the system itself, without dealing with a specific accident experience, and they are used at the highest level of a complex safety case. However, among these patterns, Hazard Avoidance Pattern is close to our requirements (Figure 3.6). Figure 3.6: Example: Hazard Avoidance Pattern [14] The approach is to argue that the system is safe by proving that every identified hazard has been mitigated or eliminated. Yet, it is too generic for our work, because it can be used only in the highest level of a safety-case and it does not provide claims about the solutions to the discovered hazards. For these reasons, we have built a new pattern, namely “Accident Avoidance Pattern”, which can elucidate the accident knowledge in a way similar to Hazard Avoidance Pattern but more specific than it. 57 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica 3.3.1 Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Construction and formalization Basically, at the highest level it is possible to identify two types of argument approach: • Functional decomposition argument • Hazard directed argument The first type argues over different system functions or requirements. Examples of this category are Functional Decomposition Pattern and ALARP. The second type argues over different hazards that could affect the system. Examples are Hazard Avoidance Pattern and Hazardous Contribution Software Safety Argument. Since we have knowledge about discovered hazards, our pattern will belong to the second category. In order to create a new pattern, we have followed the process described by Kelly to build every kind of argumentation. Figure 3.7 illustrates the six step process for goal structuring development, adapted to assurance case’s terminology. As shown in the figure, this is an iterative process. The first step is to identify the claims; specifically, the first action is the definition of the top-level claim, which is basically the goal of the whole argumentation. Along with this, you need to define elements that complete the claim, at least a context and, if necessary, a justification why the top-level claim has been chosen, and possible assumptions. Then, you need to develop a strategy to support the top-level claim by choosing an argument. Even this element can be possibly completed by contexts and assumptions. Once the argument has been set up, you need to elaborate it by developing the argumentation through different sub-claims (in the figure, you come back to phase 1). The process of structuring the sub-claims follows the same instructions used for the top-level claim. 58 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Figure 3.7: Six step process At the end, when all the claims and arguments have been defined, you need to support them by adding the evidences to each sub-claim. However, if you are defining a pattern, you may not add the evidences. In this case, it is up to the user to choose the appropriate data and results to be used as evidence. As result of this six-step process, we have developed the following “Accident Avoidance Pattern”, presented in Figure 3.8. The formalized documentation of this pattern is reported in Appendix A. 59 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Figure 3.8: Accident Avoidance Pattern 60 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains In this pattern, the intent is to argue that a specific accident, which could generate severe consequences, can never occur in future by showing the addressing of the identified hazards. It can be used either as stand-alone or as a support for a higher-level safety case in which the safety is assured by showing the satisfaction of safety system requirements and/or how different identified accidents can be avoided. This pattern has been inspired by the Hazard Avoidance Pattern, but it provides a more specific structure for the argumentation over the hazards that can lead to the accident. Since our objective is to assure that a similar accident can never happen again in future, we have chosen as top-level claim the statement “Accident X can never occur using system Y”. In order to be attached to another, higher-level assurance case, we have generalized it with the public indicator. The justification of why the accident has been chosen is that “It can cause consequences U if some hazards haven’t been addressed”. Then, we need to define which kind of system and what accident we are talking about: the two context elements “System X operating role and context” and “Description of the accident Y” are employed for this purpose. A sub-claim that better clarifies the top-level claim follows: it states that “All possible hazards in accident Y have been addressed”. The context element “Context of identified hazards” is used to list all the hazards discovered through the reports and analyses. The assumption “All possible hazards in accident Y have been identified” assumes that any other hazards not experienced in this accident cannot happen using this system. If this assumption is considered too strong, it can be proved by one or more evidences. 61 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains After this, we need to develop the strategy. In this case, as in the Hazard Avoidance Pattern, the strategy is an “argument over each hazard”. So, we need to specify that “each hazard W has been mitigated or eliminated”. Differently from every other pattern, in our case the argumentation continues through the addressing of each hazard by showing the specific proposed solution. In this phase, we can use directly the results from ECFMA, where solutions have been already structured and related to the problems. The claim “Solution Z will avoid hazard W” is constructed from this information. As depicted in the pictures, we have used the multiplicity element to describe the argumentation over more than one hazard and to provide more than one solution to each hazard, if necessary. The evidences are not described in the pattern, as usual, so once the pattern is instantiated they need to be provided and attached to prove that the solutions are effective. 62 Chapter 4 Case studies In this chapter we present two case studies which have been performed to illustrate how the developed methodology works in practise. They have been taken from two different safety critical domains, the first belonging to aerospace domain and the latter referring to a critical communication network. 4.1 DART spacecraft collision 4.1.1 Accident and system role context The Demonstration of Autonomous Rendezvous Technology (DART) program began in May 2001, designated by NASA (National Aeronautics and Space Administration) as an high-risk technology project, with the objective to demonstrate that a spacecraft could autonomously rendezvous with the orbiting Multiple Paths, Beyond-Line-of-Sight Communications (MUBLCOM) satellite, without human intervention. The mission was launched on April 15, 2005. DART operated properly as planned 63 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains during the first eight hours of mission, accomplishing all objectives up to that time. However, during proximity operations to MUBLCOM, the spacecraft started to use much more fuel than expected. Approximately 11 hours after the launch, DART detected that its propellant supply was almost exhausted, and it began a series of maneuvers for retirement. Although it was not known to ground personnel at the time, DART had actually collided with MUBLCOM 3 minutes and 49 seconds before initiating retirement. Out of a total 27 defined mission objectives, DART met only 11 of those objectives at the end of the mission. For this reason, NASA convened a Mishap Investigation Board (MIB). At the end of this process, an overview of the DART mishap investigation results has been publicly released [26]. Most of the following information about mission and system description have been gained from this report. The DART navigational system was guided by a pre-programmed, autonomous software system designed to use data from both an Advanced Video Guidance Sensor (AVGS) on DART and three Global Positioning System (GPS) receivers (two on DART and one on MUBLCOM). Utilizing a complex algorithm to combine data from the AVGS and GPS sensors, the navigational system would have calculated velocity and position of DART relative to MUBLCOM to determine how to use its thrusters to approach the satellite. The DART Mission Plan consisted of four phases: (I) Launch and Early Orbit, (II) Rendezvous, (III) Proximity Operations, and (IV) Departure and Retirement. In the first phase, the DART spacecraft, together with its Pegasus launch vehicle, would have been carried aboard a carrier aircraft. From there, the Pegasus rocket would have ignited, carrying DART into an early orbit below MUBLCOM. In the Rendezvous phase, after completing systems checks, DART would have fired its thrusters to move into a second phasing orbit; in this phase, navigational system 64 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains would have been guided only by GPS data. In the third phase, DART would have been led in the MUBLCOM’s orbit; here, it would have used the AVGS data instead of GPS data, to perform a series of precise, accurate manuevers with the satellite. In the last phase, DART would have moved away from MUBLCOM, expelled its remaining propellant, and remained in a retirement orbit. 4.1.2 ECFMA analysis In order to determine why the DART spacecraft has collided with MUBLCOM, we have reconstructed the events using the description of mishap provided by MIB final report. The causes and the possible solutions have been gained from the report itself. Figure 4.1 gives a bird’s view of ECFMA diagram built for DART collision. Since it is quite big, we have summarized the content of each element with just one word, as in the style of ECF chart example in chapter 1 (figure 1.1). The purpose of this view is just to give an idea of the ECFMA final structure. Of course, it has been split in many parts in order to be much more readable and to give details about every elements. Figures 4.2 to 4.5 provide the split parts. 65 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains 66 Figure 4.1: ECFMA chart of DART collision: bird’s view Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains 67 Figure 4.2: ECFMA chart of DART collision: initial events Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains 68 Figure 4.3: ECFMA chart of DART collision: middle events Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains 69 Figure 4.4: ECFMA chart of DART collision: upper conditions Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains 70 Figure 4.5: ECFMA chart of DART collision: final events Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains As shown in the diagram, there are several causes contributing to the collision. Specifically, we have identified 9 causes, that are listed below in Table 4.1. HAZARD TYPE Inadequate assessment of project technical risks and review of project’s risk level classification ROOT CAUSE #1 Lack of adequate documentation of flight code changes and pre-flight simulation and testing that had not taken these changes into account ROOT CAUSE #2 Reuse of software architecture from a launch vehicle that was inadequate for autonomous space operations because of its lack of adaptability to unanticipated inputs ROOT CAUSE #3 Failure to utilize lessons learned from past NASA projects ROOT CAUSE #4 Inaccurate navigation system measurements from the primary GPS receiver to determine DART’s position and velocity to the MUBLCOM DIRECT CAUSE #1 Level of gain set at such a level that the calculations could never converge once the initial reset happened and that determined the infinity-loop reset CONTRIBUTORY CAUSE #1 Waypoint for the switchover too small CONTRIBUTORY CAUSE #2 The software logic for collision avoidance system was dependent on the same navigation system CONTRIBUTORY CAUSE #3 Ground operator had not the capability to drive the DART remotely CONTRIBUTORY CAUSE #4 Table 4.1: DART’s collision: list of identified hazards First of all, the direct cause of the accident has been the inaccurate measurements from the primary GPS receiver. This failure has not been mitigated neither by the collision avoidance system, because of its dependance on the same navigation system (contributory cause), nor by ground personnel, because of completely autonomous nature of the mission (contributory cause). However, even if the measurements were 71 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains inaccurate, the area for the switchover from GPS to AVGS had been designed too small to cover such a mistake (contributory cause). Since the project has changed its significance during the design and implementation, the risk level has increased; though, it has not been reviewed (root cause). Moreover, even if it was an important, high-risk mission for NASA, it has been also designated as a low-budget project; for this reason, the reuse of an architecture from a launch vehicle and modification to flight code have been performed without thorough analysis and simulation (root causes). Specifically, a parameter causing an infinity-loop reset in navigation system has been changed without a correct validation (contributory cause). Most of these problems would have been avoided with the use of lessons learned from past NASA projects (root cause). 4.1.3 Assurance case Following the use of ECFMA to analyze the accident, we need to create the assurance case in order to argue that the proposed solutions to the identified problems will actually avoid the same accident. Figure 4.6 shows the instantiation of the Accident Avoidance Pattern for DART collision. As for the ECFMA chart, some excerpts have been provided to zoom on the details of hazard’s and solution’s claims. Figure 4.7 shows the highest part of the assurance case. 72 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains 73 Figure 4.6: Assurance case of DART accident: bird’s view Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains 74 Figure 4.7: Assurance case of DART accident: top-level claim Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains The instantiation of the Accident Avoidance Pattern follows the process described in the previous chapter in Figure 3.7. The top-level claim “Collision with MUBLCOM can never occur using DART spacecraft” refers to a kind of accident whose consequences - “damages and the premature end of the mission” - are well known. The three contexts element - “DART operating role and context”, “Collision with MUBLCOM context” and “Context of the identified hazards” - give the basis for the argumentation; since these elements could be too big in the assurance case if completely explained, we have indicate a reference to the previous section in which the information has been already provided. The assumption about the identification of all the hazards is supported by the existence of a fully detailed investigative report, MIB report. Figures 4.8 to 4.12 illustrates the argumentation provided for each hazards and solutions. It has been split in five parts, showing the details of the involved elements. 75 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Figure 4.8: Assurance case of DART accident: first excerpt 76 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Figure 4.9: Assurance case of DART accident: second excerpt 77 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains 78 Figure 4.10: Assurance case of DART accident: third excerpt Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Figure 4.11: Assurance case of DART accident: fourth excerpt 79 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Figure 4.12: Assurance case of DART accident: last excerpt 80 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains In these other excerpts we can see the instantiation of claims referring to hazards and problems got by directly using the results of ECFMA analysis. There are 9 discovered causes - both root, direct and contributory causes - and one solution for each of them. In order to demonstrate the effectiveness of this argumentation, we need to support the claims using evidences. Since we have had not the capability to perform a thorough and full analysis on a real system, we have attached a list of possible evidences that engineers can use to prove the proposed solutions. This is a common practise when, at a certain stage of safety case development, some elements remain uninstantiated or undeveloped. The attached evidences belong to a category are not a specific kind of analysis or tool: for example, we have indicated “Reliability testing results” to support the claim that “the use of a minimum fault tolerant system will avoid the hazard“, so engineers could use different tools, such as Reliability block diagram or Fault Tree Analysis, to gain these results. The information about all the possible evidences both qualitative and quantitative - that can be used to support an argumentation have been taken from a survey on provision of evidence for safety certification [27]. In this way, the attachment of evidence completes the building of assurance case to structure the knowledge of DART collision. 81 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica 4.2 Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Multistate 911 outage 4.2.1 Accident and system role context During the night between April 9 and April 10, 2014, a 911 call-routing facility in Englewood, Colorado, stopped to route 911 calls towards seven American states - California, Florida, Minnesota, North Carolina, Pennsylvania, South Carolina, and Washington - causing the failure of about 7,000 emergency calls. The loss of 911 service affected more than 11 million people and it was prolonged for six hours. Fortunately, there were no deaths or severe injuries as result of the emergency communication loss. The accident has involved the Next Generation 911 (NG911) system, a new network that relies on IP-supported architecture instead of the traditional circuitswitched time division multiplexing (TDM) architecture, with the aim to provide new capabilities such as dynamic call routing and video transmission. However, as demonstrated by this outage, there are also new challenges about reliability and safety. That’s because, as 911 has become a more technological network, the interaction of new and old systems has introduces new vulnerabilities and hazards. In order to understand the problems in a such evolving infrastructure and improve the complete deployment, the Public Safety and Homeland Security Bureau (PSHSB) investigated on this accident, concluding the investigation with a final report released in October 2014 [28]. On the day of the outage, the 911 architecture was in a transition stage between conventional 911 network and the NG911. Different companies are involved in this infrastructure. The main ones are Intrado, a provider of 911 and emergency communications infrastructure, systems, and services for state and local public safety agen- 82 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains cies throughout the United States, and CenturyLink, which maintains the Washington’s Emergency Services IP Network. Several service providers produce emergency calls to be routed through the network to an answering center. Figure 4.13, provided from report [28], depicts the transition architecture used in the State of Washington at the time of the accident. The red elements are managed by Intrado, the green ones by Century Link. It is also indicated the Englewood facility in which the failure has originated. Figure 4.13: Washington NG911 Transition architecture [28] In this infrastructure, “a caller dials 911, and the call is routed through the network of the originating service provider to one of four Intrado gateways serving 83 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Washington State, two in the Seattle area (western Washington State), and two in the Spokane area (eastern Washington State). This gateway, which converts the signal from TDM (Time Division Multiplexing) to IP, is also 911-aware and queries other databases to determine the primary Internet Protocol Selective Router (IPSR) for the PSAP that serves the caller’s location. Under normal conditions, the gateway then routes the call to the primary IPSR through a managed IP network, some of which belongs to CenturyLink and other parts of which are provided for those purposes by Intrado. The IPSR is also 911-aware. It queries various databases (shown as “911 DB” in Figure 4.13) to identify the correct Public Safety Answering Point (PSAP) and to properly address packets to that PSAP. The call is then routed through the “CenturyLink IP Network” to the PSAP. The IPSR is no longer located in the local exchange carrier (LEC) central office, or even in Washington State, but is now in Colorado, with a single “manual failover” backup in Florida. As is often the case in conventional 911 architecture, databases are also located in other states.” [28] Since both traditional TDM and IP-based service calls need to be routed through the same architecture, engineers have provided IPSR with the capability of assigning a PSAP Trunk Member (PTM) when a TDM call is served by that IPSR. As we can see in the next section, this function has been involved as proximate cause of the outage. 4.2.2 ECFMA analysis Starting from the facts and analysis published in the PSHSB report and using the previous architecture description, we have performed ECFMA analysis. As in the DART’s case of study, we provide both a bird’s view and excerpts of ECFM chart. 84 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains 85 Figure 4.14: ECFMA chart of 911 outage: bird’s view Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains In this outage, we have identified 6 main causes. They are listed in the table below. HAZARD TYPE Lack of network workload analysis ROOT CAUSE #1 Call control and management functions not adequately balanced among ECMC facilities ROOT CAUSE #2 Too much dependance on few critical elements without adequate safeguards in place ROOT CAUSE #3 Low threshold for PTM counter DIRECT CAUSE #1 Inadequate alarm management to generate an alarm for a major outage CONTRIBUTORY CAUSE #1 Lack of communication among different involved providers CONTRIBUTORY CAUSE #2 Table 4.2: 911 outage: list of identified hazards The proximate cause of the outage has been a software error regarding the PSAP Trunk Member (PTM) counter, which traces the number of calls handled by the IPSR. It has been assessed that the threshold in PTM counter was too low for the generated workload. This setting comes from the lack of a correct workload analysis on the network (root cause). Moreover, the report identifies two architectural problems, regarding the balance of call control and management functions and the overload of few facilities in routing the calls, respectively (root causes). Finally, after the outage it was difficult to pinpoint the facility generating the problem. Two contributory causes have been discovered: an inadequate alarm system, which has generated only low-level alarms in response to the outage, and a lack of communication among service providers to help each other in locate the problem. The effect has been a delay of six hours in solving the outage and recovery the emergency infrastructure. Figures 4.15 to 4.17 shows the detailed events with the related causes. 86 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Figure 4.15: ECFMA chart of 911 outage: initial events 87 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains 88 Figure 4.16: ECFMA chart of 911 outage: accident Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains 89 Figure 4.17: ECFMA chart of 911 outage: post-accident events Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica 4.2.3 Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Assurance case Figure 4.18 illustrates the assurance case bird’s view derived for 911 outage. Again, as for DART’s case, we have derived excepts to better show the details of the argumentation. Figure 4.19 shows the highest part of the assurance case. The creation of the case follows the same six-step process used for the previous DART case. The first element to be instantiated is the top-level claim “Multistate 911 outage can never occur using NG911 infrastructure”, which has been chosen because it can cause “potential damages and injuries”. Note that the outage has not determined deaths or other severe losses, but they potentially may have occurred. Moreover, this case can be used as a support in an higher-level assurance case in which we want to argue about the reliability of the system, instead of its safety. The three context elements - “NG911 operating role and context”, “Multistate 911 outage” and “Context of the identified hazards” - give the basis for the argumentation; as in the previous case, we have used a reference to the description section in which the information has been already provided in order to make the assurance case more readable. The assumption about the identification of all the hazards is supported by the PSHSB report, released after a five-month investigation. 90 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains 91 Figure 4.18: Assurance case of 911 outage: bird’s view Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains 92 Figure 4.19: Assurance case of 911 outage: top-level claim Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Excerpts about hazards and the related solutions provided by ECFMA follow in the assurance case (Figures 4.20 to 4.22). In this case of study, we have identified six causes. The solutions have been got directly from the report, within the analysis chapters. In this example, we have also identified two possible solutions to eliminate an hazard: specifically, “inadequate alarm management” hazard can be solved by both “an adaptive alarm management” and “the update of alarm severity and troubleshooting instructions”. Finally, in order to complete the argumentation, we need to attach the evidences to the solution’s claims. Wherever possible, we have used a concrete evidence. This is the case of the solution’s claim “higher limit value for PTM” supported by the “Post-accident actions” evidence, since the limit has been modified some days after the outage and its effectiveness as countermeasure has been demonstrated. However, for all other solutions we have attached a list of possible evidences that engineers can use to prove them. Again, we have used as generic evidences the suggestions provided in the survey from reference [27]. 93 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains 94 Figure 4.20: Assurance case of 911 outage: first excerpt Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains 95 Figure 4.21: Assurance case of 911 outage: second excerpt Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains 96 Figure 4.22: Assurance case of 911 outage: third excerpt Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica 4.3 Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Discussion on the methodology In the previous sections we have presented two case studies in order to discuss on the methodology using their results. We want to show the improvement of accident knowledge comparing the use of the sole list of recommendations, issued by investigative agencies in their final reports, and the use of our methodology. First of all, we focus on the nature of recommendations and their effectiveness. As indicated in the NTSB Investigative process in reference [2], safety recommendations “are based on the findings of investigation, and may address deficiencies which do not pertain directly to what is ultimately determined to be the cause of the accident”. This means that they address underlying problems and organizational deficiencies, most of them corresponding to the discovered root causes. In Table 4.1 from Section 4.1.2 we have listed the identified causes for DART’s case of study. In the Table 4.3 we have, instead, indicate the correspondent safety recommendations as called in the report, if available, which address the hazards. DART’s HAZARD RECOMMENDATION ROOT CAUSE #1 “Risk posture management” ROOT CAUSE #2 “Guidance, Navigation and Control (GN&C) Software Development Process” ROOT CAUSE #3 “High Risk, Low Budget Nature of the Procurement” ROOT CAUSE #4 “Lessons Learned Analysis” DIRECT CAUSE #1 “FMEA” recommendation CONTRIBUTORY CAUSE #1 none CONTRIBUTORY CAUSE #2 none CONTRIBUTORY CAUSE #3 “FMEA” recommendation CONTRIBUTORY CAUSE #4 none Table 4.3: DART’s collision: correspondence with recommendations 97 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains As we can see, 6 out of 9 causes (66%) have been reported in the final recommendations with suggested ways to correct the problems. Among them, there are all the root causes, the direct cause but only 1 out of 3 contributory causes. In fact, the remaining causes not mentioned in the recommendations list are related to specific, technical problems: the incorrect gain parameter, which has determined the infinity-loop reset, the small waypoint and the incapability of remote control from ground. Moreover, among the causes, 4 out of 6 refer to the “process”, intended as the design and operation actions that people should take, while the remaining 2 deal with the final “product”, which is the system itself. The correspondences for 911 outage have been summarized in the following table. 911’s HAZARD RECOMMENDATION ROOT CAUSE #1 none ROOT CAUSE #2 none ROOT CAUSE #3 none DIRECT CAUSE #1 none CONTRIBUTORY CAUSE #1 none CONTRIBUTORY CAUSE #2 “Contractual relationship monitoring” Table 4.4: 911 outage: correspondence with recommendations In this case, only 1 out of 6 causes (16%) has been worked out in detail in the final list of recommendations. It is a contributory cause regarding the communication problem - an organizational deficiency - and it refers to the “process” of managing the system. Solutions and countermeasures for the other hazards, most of whose are technical issues, are described through the report. Though, they are not referenced by safety recommendations; there is just one of them, “Develop and Implement NG911 Transition Best Practices”, where they stated to have “shed light on a number of measures that providers can take to improve service reliability during this transition”, although they don’t give any references to them. 98 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains With this two case studies, we have shown that the list of recommendations is not enough to cover all the causes and avoid a similar accident. That’s because of the recommendations’ nature, which mainly deal with high-level problems. Instead, our approach aims to work out all the causes - root, direct and contributory causes using all the knowledge provided by the reports. In this way, we can eliminate both organizational deficiencies and technical problems, so that we can avoid both specific hazards turned up in the accident and potential problems that have not turned up in the episode. The other main problem of safety recommendations is their understandability. We can define this property as the quality of information which makes it understandable by people with reasonable background knowledge of business and technical activities. If we imagine an organization that is not directly involved in the accident but that can experience a similar mishap in the same domain, the simplest way to reuse this knowledge is reading and applying the published recommendations. Though, these statements are directed to the stakeholders in a way that it is difficult for this third-party organization, not involved in the episode, to implement the advices. In our approach, an organization can use all the knowledge provided by the final reports - not only the recommendations, but also facts, analyses and descriptions, even from other documents - to reconstruct events and conditions. After this phase, performed through ECFMA, engineers can structure the knowledge in an argumentation. The use of ECFMA, in which solutions have already been connected to the problems, is useful to identify and solve the problems, while the instantiation of the assurance case pattern is performed to demonstrate the validity of the solutions themselves. In fact, even if some solutions are taken from the same recommen- 99 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains dations, their effectiveness is argued through the use of specific evidences, which engineers can use to confirm or refuse the argumentation. Another property, related to the understandability, about the knowledge that we are elucidating is the learnability. We can consider the concept behind this property using the Standard ISO/IEC 9126-1:2001 for the product quality, in which the learnability is defined as “the capability of the software product to enable the user to learn its application”. In our context, we can define it as “the capability of an information to enable the user to learn its application” [29]. We can imagine a spectrum: the best knowledge is an information in which, given a problem, there is a straightforward solution to work it out, while the worst one is an information with just an identified problem without any suggestions on how to solve it. In our case, we have seen that the nature of recommendations is deliberately general, with a brief, textual synthesis of a problem and a generic suggestion on how to mitigate the problem. This is true not only for high-level problems, such as the organizational deficiencies or the communication troubles, but also for the few technical issues that are reported. For example, in DART case of study the “FMEA recommendation” states that “NASA should define the minimum fault tolerance required for spacecraft performing rendezvous missions in order to protect space assets from collision”. It can be considered a requirement more than a way on how to design and develop a fault tolerance system for the spacecraft. Instead, in our methodology we can insert in the argumentation both technical countermeasures, which are usually found out in the report analyses, and solutions from the recommendations. In this way, even if they are considered as requirements to be implemented, their effectiveness is proved by evidences. This 100 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains makes the gained knowledge easier to be applied in practise. Moreover, along with the understandability and learnability, we can discuss on the reusability of our methodology, where this property is defined as the ability of an item that allows it to be used repeatedly. In our context, we can imagine the use of our methodology as an asset by engineers. If they want to identify from an accidents catalogue a similar occurrence, they need to read the report and find out causes and solutions in order to verify whether these elements are relevant for them or not. This operation can be performed through the reuse of the assurance case instantiated previously through the Accident Avoidance Pattern, in which problems and solutions are well defined and supported by evidences. For example, a problem such as “the inadequate balance of call control and management functions” in the 911 outage can occur again in an other emergency communication network: if the engineers consider this problem as relevant for them, they can apply the provided solution about “the reviewed ingress trunking configuration distribution”. By demonstrating the effectiveness of the solution through the “configuration management plan” and the “performance testing results”, they will have avoided a potential undiscovered hazard. Moreover, all the problems are described at the same level of the assurance case as well as the solutions are represented in the same sub-level. These features allow the whole artifact, or parts of it, to be used repeatedly even by other engineers operating in the same domain. Finally, a quality of our assurance case pattern is its flexibility. This refers to two features. The first one is its possible use either as a stand-alone assurance case or 101 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains as support to an higher-level assurance case; the second one deals with the system’s property which is argued in the higher-level assurance case. In fact, the top-level claim referring to an accident, namely “Accident X can never occur using system Y”, allows such higher-level case to argue not necessarily over the system’s safety, but also other properties, such as reliability, availability or maintenability by proving that different identified accidents cannot occur in the concerned system. There are also some possible drawbacks in our methodology. One of them could be the lack of a logic relationship in the assurance case between root causes, direct causes and contributory causes, which are treated at the same level. Though, the use of the context element about the identified hazards makes clear the nature of each cause. Moreover, in order to avoid a similar accident we claim that it is necessary to address all the hazards, even the ones which have been contributory causes, since they would have mitigated the mishap. For this reason, it may be easier to argue them one by one at the same level of the assurance case by developing the argumentation in a vertical way. Regarding the hazards, we state that all the possible hazards highlighted by the accident have been identified by the investigative agency. This assumption is quite acceptable, since the investigation lasts many month involving contracting authorities, system providers, regulators and emergency bodies, and it is conducted by a Board that has experience of similar accidents in the same domain. Though, it is possible that other hazards for the same system have not turned up in this specific accident, which is quite challenging to assure completely. However, our approach aims to reduce the risk of new potential problems by solving the root causes and, so, minimizing the probability of other undiscovered hazards. 102 Future work We have also highlighted some possible features of our methodology that can be reviewed in future. One of them is the use of the Accident Avoidance Pattern attached to an higher-level assurance case. In our case studies we have considered the instantiation of the pattern as stand-alone, without claiming about any system’s property. If, instead, it is used in an high level assurance case, it could argue over the addressing of system requirements and the avoidance of several identified accidents. In this case, it may be that some solutions provided to solve discovered hazards have already been implemented according to a system requirement, so that it can be redundant in the assurance case. For this reason, it is valuable to identify these overlapping claims and eliminate the redundancy. Moreover, we can evaluate other properties of the methodology through the use of a real case. One of them could be the efficiency: we can show that our approach is also efficient in time if the recommendations are not clear and specific enough to be applied, as they are in most of the cases. However, this property could be evaluate only through quantitative measurements about the time elapsed in applying either the methodology or the list of recommendations in a real case. 103 Conclusion In this thesis the goal was to develop a methodology for reusing and elucidating the accident knowledge in safety critical domains. We have used the concepts of the new standards about Assurance case and GSN (Sections 2.2 and 2.3). Assurance case is a novelty in research, since it represents a generalization of safety case, which is, instead, a standard de facto in industry for the certification of safety critical systems. Further, the thesis’ contribution includes the investigation on works about the safety case lifecycle (Section 2.5.1), which, despite its relevance, has not been deeply developed in the research area of safety assurance. Moreover, we have illustrated ECFA (Section 3.2.1), an investigation tool used by public agencies to reconstruct what happened in an accident. By combining these elements we have developed a methodology, whose results can be used as an agreement between a supplier and an acquirer or as an asset by developers to increase knowledge of a safety critical domain. We have also highlighted some possible features of our methodology that can be reviewed in future. For example, the application of the approach in a real case can be an excellent way to evaluate its efficiency and its use as part of a complex assurance case. However, these ideas agree with the objective with which we began this thesis: managing to improve the knowledge of computer systems employed in social domains. 104 Appendix A Accident Avoidance Pattern formalization The following table reports the formalized documentation of the Accident Avoidance Pattern, as described by Kelly and presented in chapter 2.4.2. Accident Avoidance Pattern Author Mirko Napolano Created 17/11/2014 Last modified 13/02/2015 105 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Intent The intent is to argue that a specific accident, which could generate severe consequences, can never occur in future by showing the addressing of the identified hazards. It can be used either as stand-alone or as a support for a higher-level safety case in which the safety is assured by showing the satisfaction of safety system requirements and/or how different identified accidents can be avoided. Motivation This pattern has been inspired by the Hazard Avoidance Pattern, but it provides a more specific structure for the argumentation over the hazards that can lead to the accident. By defining the hazards for a specific accident, arguing about the solutions that avoid them is easier than arguing over the safety of the whole system. Moreover it can be attached in a higher-level assurance case in which the avoidance of different accidents is argued along with the addressing of system design requirements. Structure 106 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Participants Accident Avoidance Pattern: Improving Knowledge for Safety critical domains accidentAvoidance: it is the top-level claim, which defines the objective of the pattern. This is a public goal, which may be referenced by a higher-level safety case that argues over the safety of the system by avoiding different accidents. The linking can be performed using an away goal reference. accidentCause: this is used to justify that, if the {accident Y} is not prevented, it will cause {consequences U}. systemDef: it describes the characteristics of the concerned system and its operating role. If it has been already used in a higher-level safety case, it can be omitted. accidentContext: this element describes the context of the {accident Y}. hazAddr: it explains better why its top-level claim is true, stating that the hazards in the {accident Y} have been addressed. The goal is supported by the argumentation that each identified hazard has been mitigated or eliminated. hazDef: it describes the nature and the characteristics of the identified hazards. hazIdent: this assumption claims that all the hazards that can lead to the {accident Y} have been discovered and identified. If it is needed, an evidence supporting this assumption can be attached. argHazAddr: this strategy provides an argumentation for the attenuation of the identified hazards by discussing each of them separately. There can be more than one hazard for {accident Y}. specHazAddr: This goal is used to claim that the specific {hazard W} has been attenuated, either by mitigating or eliminating it. specHazContext: This element describes the context of the {hazard W}. specHazSolution: This element introduces the {solution Z} for the {hazard W}. There could be more than one solution for the hazard W. Collaborations • accidentAvoidance introduces the claim about the avoidance of an identified accident. This claim is supported by the argumentation over the n addressed hazards introduced by hazDef. • hazIdent is useful to assume that no other hazards from the {system X} can contribute to the {accident Y}. • all the specHazSolution elements provide a complete explanation about the hazards resolution. 107 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Applicability This is a specific pattern that can support a higher-level hazard-directed safety argument, in which the system safety is argued by avoiding different identified accidents. This pattern should be applied when the hazards in the concerned accident are well clear and have been completely identified. One possibility is to apply this pattern by reusing the experience of an accident or an event happened in the concerned context. The elements accidentContext, hazDef and specHazContext need to be clearly described. Consequences After instantiating this pattern a number of undeveloped goals (n, where n = # of identified hazards) will remain: • specHazSolution (n of): for each solution it is necessary to support this claim, namely that the implemented solution will actually avoid the related hazard. Both qualitative and quantitative evidences can be provided to support this claim. In addition, if it is needed to support the hazIdent assumption, an evidence supporting it should be used. Implementation A top-down approach should be used by instantiating the goals and the related contexts before the strategy element. This pattern assumes that all the possible hazards in the concerned accident have been already identified. Each hazard should be discussed one by one, starting from the specHazAddr claim up to the needed evidences. Possible pitfalls • Not correctly describing the concerned accident in the upper part of the pattern may lead to an ambigous explanation of the assurance case. • Not exhaustively identifying all the hazards in the concerned accident described in hazDef context element may lead to an incomplete argumentation on the accident avoidance. • Not providing all the needed evidences supporting the specHazSolution claim may lead to an unconvincingly argumentation on the hazards addressing. 108 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains Example Related Patterns • Hazard Avoidance Pattern: this is a pattern that assures the safety of a system arguing over different identified hazards. Though, it is too generic to be directly used to assure a system’s property in a specific situation. 109 Bibliography [1] O. Bowen and V. Stavridou, “Safety Critical Systems, Formal Methods and standards”, Software Engineering Journal, 1982 [2] U.S. National Transportation Safety Board (NTSB), “The Investigative Process”, http://www.ntsb.gov/investigations/process/Pages/default.aspx [3] Lebow, C. Cynthia, L.P. Sarsfield, W.L. Stanley, E. Ettedgui, and G. Henning, “Safety in the Skies: Personnel and Parties in NTSB Aviation Accident Investigations”, Santa Monica, California: RAND, 1999 [4] U.S. Department Of Energy (DOE) Handbook, “Accident and Operational Safety Analysis Volume I: Accident Analysis Techniques”, July 2012 [5] The Dutch Safety Board, “Crashed during approach, Boeing 737-800, near Amsterdam Schiphol airport, 25 February 2009”, May 2010 [6] “IEEE Standard Adoption of ISO/IEC 15026-1 - Systems and Software Engineering - Systems and Software Assurance - Part 1: Concepts and Vocabulary”, November 2014 110 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains [7] “IEEE Standard Adoption of ISO/IEC 15026-2:2011 - Systems and Software Engineering - Systems and Software Assurance - Part 2: Assurance case”, September 2011 [8] “IEEE Standard Adoption of ISO/IEC 15026-3 - Systems and Software Engineering - Systems and Software Assurance - Part 3: System integrity levels”, June 2013 [9] “IEEE Standard Adoption of ISO/IEC 15026-4 - Systems and Software Engineering - Systems and Software Assurance - Part 4: Assurance in the life cycle”, August 2013 [10] S. Wilson, J. McDermid, P. Fenelon and P. Kirkham, “No More Spineless safety Cases: A Structured Method and Comprehensive Tool Support for the Production of Safety Cases”, 2nd International Conference on Control and Instrumentation in Nuclear Installations (INEC’95), Cambridge, UK 1995 [11] S. Toulmin, “The Uses of Argument”, (1958; 2nd edn, 2003) [12] “GSN Community Standard Version 1”, November 2011, http://www.goalstructuringnotation.info [13] C. Alexander, “A Pattern Language: Towns, Buildings, Construction”, Oxford University Press, 1977 [14] T.P. Kelly, “Arguing Safety: A Systematic Approach to Managing Safety Cases”, PhD Thesis, University of York, 1998 [15] E. Gamma, R. Helm, R. Johnson, J. Vlissides, “Design Patterns: Abstraction and Reuse of Object-Oriented Design”, ECOOP’93 - Object Oriented Programming, 7th European Conference, Kaiserslautern, Germany 1993 111 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains [16] T.P. Kelly, J.A. McDermid, “Safety Case Construction and Reuse using Patterns”, in 16th International Conference on Computer Safety, Reliability and Security, SAFECOMP, 1997 [17] W.S. Greenwell, E.A. Strunk, J.C. Knight, “Failure analysis and the safetycase lifecycle”, IFIP Working Conference on Human Error, Safety and System Development (HESSD), Toulouse, France 2004 [18] W.S. Greenwell, “Pandora: An Approach to Analyzing Safety-Related DigitalSystem Failures”, PhD Thesis, University of Virginia, 2007 [19] A. Ruiz, I. Habli, H. Espinoza, “Towards a Case-Based Reasoning Approach for Safety Assurance Reuse”, 1st Workshop on Next Generation of System Assurance Approaches for Safety Critical Systems (SASSUR), Magdeburg, Germany 2012 [20] Y. Matsuno, “A design and implementation of an Assurance case language”, 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014 [21] R. Alexander, T. Kelly, Z. Kurd, J. McDermid, “Safety cases for advanced control software: Safety case patterns”, Technical report, Department of Computer Science, University of York, 2007 [22] E. Denney, G. Pai, “A formal basis for safety case patterns”, 2nd International Conference, SAFECOMP, Toulouse, France 2013 [23] R. Hawkins, T. Kelly, “A software safety argument pattern catalogue”, Technical report, The University of York, 2013 [24] T. Kelly, J. McDermid, “Safety case construction and reuse using patterns”, in SAFECOMP, pages 55-69, 1997 112 Scuola Politecnica e delle Scienze di Base Corso di Laurea in Ingegneria Informatica Accident Avoidance Pattern: Improving Knowledge for Safety critical domains [25] R. A. Weaver, “The Safety of Software - Constructing and Assuring Arguments”, PhD thesis, Department of Computer Science, University of York, 2003 [26] NASA, “Overview of the DART Mishap Investigation Results, for Public Release”, May 2006 [27] S. Nair, J.L. de la Vara, M. Sabetzadeh, L.C. Briand, “An extended systematic literature review on provision of evidence for safety certification”, Information and Software Technology Volume 56, Issue 7, July 2014 [28] Public Safety and Homeland Security Bureau (PSHSB), “April 2014 Multistate 911 Outage: Cause and Impact”, October 2014 [29] International Organization for Standarization (ISO), “ISO Standard 9126-1: Software engineering - Product quality - Part 1: Quality model”, 2001 113