by Julius DeSilva

Accidents and failures, whether in maritime, aviation, healthcare, or nuclear settings, are often subjected to intense scrutiny to determine their root causes. However, the challenge lies in distinguishing whether an event is an anomaly or a symptom of a deeper systemic issue. This analysis is crucial as it directly influences the actions taken to prevent a recurrence or occurrence elsewhere. A management system approach, such as those outlined in ISO 45001 for occupational health and safety, ISO 9001 for quality management, or ISO 14001 for environmental management, provides a structured framework for systematically and proactively addressing risks when data exists.

Analysis of root causes: systemic failures

Root cause analysis is a fundamental investigative tool used to trace an incident to its origins. However, many organizations focus on immediate, apparent causes rather than examining systemic contributors and true root causes. Systemic failures result from weaknesses in policies, processes, or culture, and therefore, often recur in different forms over time.

The management system approach advocated by ISO standards and other industry-specific standards like the ISM code emphasize continual improvement and risk-based thinking. The intent of these standards is to reduce the probability of systemic failures by integrating safety, quality, efficiency, security, and environmental management into everyday operations.

Systemic failure example: Chernobyl

I recently read the book Midnight in Chernobyl, which outlined the 1986 Chernobyl nuclear disaster and the underlying systemic failures that contributed to this incident. Unlike isolated accidents, Chernobyl resulted from a combination of design flaws, operational errors, and a deficient safety culture. Key systemic issues included:

  • Design flaws. The RBMK reactor used in Chernobyl had an inherent positive void coefficient, meaning an increase in steam production could accelerate the reaction uncontrollably.
  • Operational failures. A safety test was conducted under unsafe conditions, including a reduced power level and disengaged emergency shutdown mechanisms.
  • Cultural and regulatory gaps. A lack of safety culture, insufficient training (and thus competency), and an authoritarian management style amounting to complacency discouraged questioning of unsafe practices.

These root causes culminated in an explosion that released massive amounts of radioactive material. European countries are so tightly packed that winds freely spread the outfall without borders. The systemic nature of the disaster was later addressed through international nuclear safety reforms, including the establishment of the International Atomic Energy Agency’s safety standards and stricter ISO frameworks such as ISO 19443, which outlines quality management system requirements for organizations working within the nuclear sector.

Other systemic failures

Deepwater Horizon oil spill (2010)

Another example of a systemic failure is the Deepwater Horizon oil spill. This incident was not merely the result of a single mistake but a consequence of systemic lapses in safety practices, regulatory oversight, and risk management. Contributing factors included:

  • Cultural deficiencies. The organization prioritized cost cutting over risk mitigation
  • Inadequate risk assessments. There was poor well-integrity testing and misinterpretation of pressure data.
  • Regulatory weaknesses. There was insufficient government oversight and a lack of stringent industrywide safety protocols.

This catastrophe led to significant regulatory changes, including the implementation of stricter safety and environmental policies within the oil and gas industry, aligned with ISO 45001 and ISO 14001.

The Boeing 737 MAX crashes (2018, 2019)

The Boeing 737 MAX crashes further illustrate systemic failure. Investigations revealed that flaws in the aircraft’s Maneuvering Characteristics Augmentation System (MCAS) were not adequately addressed due to:

  • Design and engineering oversights. Critical safety features were made optional rather than standard.
  • Regulatory gaps. The FAA relied excessively on Boeing’s self-certification.
  • Organizational pressures. The corporate culture emphasized speed-to-market delivery over comprehensive safety testing.

This resulted in significant regulatory reforms, including tighter oversight and compliance with international aviation safety standards.

Fixes vs. systemic longer-term improvement

Addressing failures can be approached through quick fixes or long-term systemic improvements. Each approach has its advantages and disadvantages:

Quick fixes

Pros:

  • Immediate resolution of pressing issues
  • Cost-effective in the short term
  • Prevents further damage or loss

Cons:

  • Does not address underlying systemic issues
  • Can lead to recurring problems if not supplemented with deeper analysis
  • Often reactive rather than proactive

Systemic longer-term improvements

Pros:

  • Addresses root causes, reducing the likelihood of recurrence
  • Enhances organizational resilience and safety culture
  • Aligns with ISO management systems, ensuring continuous improvement

Cons:

  • Requires significant time and resources
  • May face resistance from stakeholders due to cultural inertia
  • Implementation complexity can slow down immediate corrective actions

A balanced approach is often necessary—implementing short-term fixes to mitigate immediate risks while developing long-term systemic improvements to ensure sustainable safety and risk management practices.

What if we cannot foresee all risks?

Even with rigorous management systems and risk assessments, not all risks can be predicted. Organizations must be prepared to address unforeseen risks through:

  • Resilient systems. It is important to develop adaptable and robust safety management frameworks that can respond effectively to new threats.
  • Proactive learning. The organization can encourage a culture of continuous learning and scenario planning to anticipate emerging risks.
  • Redundancies and safeguards. Implementing fail-fail safe redundancies and contingency plans can mitigate the effects of unforeseen events.
  • Stakeholder collaboration. Engaging industry experts, regulators, and other stakeholders to share knowledge can help improve collective risk awareness.

Despite the lessons from Chernobyl, 25 years later the Fukushima disaster occurred. An earthquake of this magnitude was not foreseen as a risk even though in 1896 (as highlighted by an engineer on the project) an earthquake of magnitude 8.5 hit near the coast where the reactor was to be built. After Chernobyl, the 1970s-built reactor in Fukushima was not upgraded with the latest safety features due to high costs. Japan’s nuclear industry had a history of regulatory complacency and reluctance to accept international recommendations

ISO 31000, which addresses risk management, emphasizes the importance of resilience and adaptability in the face of unpredictable risks. By fostering a commitment to learning and preparedness across the organization, businesses can better navigate uncertainties while maintaining operational safety and efficiency.

The benefits of a management system approach

A management system approach, as defined by ISO standards, provides the following advantages:

  • Structured risk management. ISO 31000 ensures systematic identification, assessment, and mitigation of risks.
  • Continuous improvement. The Plan-Do-Check-Act (PDCA) cycle described in ISO 9001, ISO 45001, and ISO 14001 encourages learning from incidents to prevent recurrence.
  • Organizational culture change. Implementing ISO standards fosters a risk-oriented mindset, reducing the likelihood of systemic failures.

ISO management systems, when implemented and sustained, can act as a preventive tool to proactively manage risk.

Conclusion

Understanding whether an accident is an anomaly or a systemic failure is critical in determining the appropriate response. Sadly, at times industry must incur the cost of the nonconformity to learn the lesson. Organizational “can-do” attitudes lead to risk normalizations where dangerous conditions are seen as normal. Further, organizational and demographic cultures do not encourage challenging authority or questioning of decisions. Absence of accidents, incident reports, and near misses give a false sense of complacency that things are working well. This may lead to over-confidence in decision making, lapses in regulatory oversight, and deferring of resource allocation to other “priorities.”

Systemic failures indicate deeper vulnerabilities requiring long-term corrective actions. The application of ISO management systems offers a proactive and structured approach to accident prevention, ensuring that organizations move beyond reactive responses to fostering a culture of continuous improvement and risk management. By embracing these principles, industries can mitigate systemic risks, ensuring safer and more resilient operations.

Note – The above article was recently featured in Exemplar Global’s publication ‘The Auditor’. Click here to read.

Recommended Posts