Human Error or a Bigger Problem? When to Dig Deeper

by Julius DeSilva

In the world of process improvement and problem-solving, human “user” error can often become the go-to explanation when things go wrong. A mis-entered data point, a forgotten step in a procedure, or a misconfigured setting—blaming the user is quick and easy. But how do you know when an issue is bigger than just user error?

Understanding when to dig deeper and identify systemic flaws is critical. By integrating structured approaches like Root Cause Analysis (RCA) and the PDCA (Plan-Do-Check-Act) cycle, organizations can shift from a reactive blame culture to a proactive, continual improvement mindset that eliminates recurring problems at their source.

The Prevalence of User Error in Different Industries

Human error has been identified as a significant contributor to operational failures across multiple sectors:

  • Cybersecurity: According to the World Economic Forum, 95% of cybersecurity breaches result from human error.
  • Manufacturing: A study by Vanson Bourne found that 23% of unplanned downtime in manufacturing is due to human error, making it a key contributor to production inefficiencies. The American Society for Quality (ASQ) reports that 33% of quality-related problems in manufacturing are due to human error.
  • Healthcare: The British Medical Journal (BMJ) estimates that medical errors—many due to human factors—cause approximately 250,000 deaths per year in the U.S. alone.
  • Aviation & Transportation: The Federal Aviation Administration (FAA) attributes 70-80% of aircraft incidents to human error, but deeper analysis often reveals process design issues, poor training, or missing safeguards.

These statistics reinforce a key point: Human error isn’t always the root cause—it’s often a symptom of a deeper, systemic issue.

Recognizing When to Look Beyond User Error

Here’s how to tell when an issue isn’t just a one-time mistake but a signal that the system itself needs improvement:

  1. Recurring Issues Across Multiple Users – If multiple employees are making the same mistake, the problem likely isn’t individual human error—it’s a flaw in the process, system design, or training. For example, if multiple operators incorrectly configure a machine setting, it might indicate confusing controls, inadequate training, or unclear documentation rather than simple user mistakes.
  2. Workarounds and Process Deviations – If employees consistently find alternative ways to complete a task, the system may not be designed for real-world conditions. If workers routinely bypass a safety feature because it “slows them down,” the process needs reevaluation; either through retraining, redesign, or better automation. At QMII, we always reinforce building a system for the users, built on the as-is of how work is done and then making incremental improvements.
  3. High Error Rates Despite Training – If errors persist even after proper training, the issue might be process complexity, unclear instructions, or a lack of intuitive system design. If employees consistently make minor mistakes, the system interface or workflow rules might need simplification rather than just retraining staff.
  4. Error Spikes in High-Stress Situations – Mistakes often increase under time pressure, fatigue, or stress. This suggests a workload or process issue rather than simple carelessness. In a maritime environment, high error rates during critical operations could signal staffing shortages, inefficient safety interlocks, or poor user interfaces on devices.

Instead of just fixing errors after they happen, organizations should use the PDCA (Plan-Do-Check-Act) cycle to continually improve processes and reduce the probability of recurring failures.

The PLAN-DO-CHECK-ACT Approach

PLAN – Identify the context and potential risks

  1. Identify the context of the process including the competence of personnel, user environment, complexity and influencing factors.
  2. Apply Failure Mode and Effects Analysis (FMEA) to predict where failures are likely to happen before they occur.
  3. Identify and involve representatives of users through the development of FMEAs and the process.
  4. When predicting controls and resources, determine the feasibility of implementing and providing them.
  5. Simplify procedures, redesign workflows, or introduce automation to eliminate failure points.

DO – Implement the Process and Improvements

  1. Implement the process and test it to check its effectiveness. In the initial stages more frequent monitoring and measurement will be required. The periodicity between checks can be reduced as the process matures.
  2. Provide user training and assess its effectiveness. When errors occur retrain personnel, but only if training is truly the issue—don’t use training as a Band-Aid for bad system design.
  3. Look beyond documented “standard-operating” procedures. As an example: The company implements a visual step-by-step guide near machines to ensure operators follow a standard calibration process.

CHECK – Evaluate the Results

  1. Track performance data to see if the changes have reduced errors.
  2. Get user feedback to ensure the new system is intuitive and efficient. For example, Error rates drop by 40%, but operators still struggle with a specific step—prompting another refinement.

ACT – Standardize & Scale

  1. If the improvement is successful, integrate it as the new standard process.
  2. Scale the change across other departments or sites where similar issues might exist. For example, the company implements the same calibration guide and training approach across all locations, preventing similar errors company-wide.

Conclusion: From Blame to Solutions

While human error is a reality, it’s often a symptom of a deeper process flaw, not the root cause. Those involved in conducting a root cause analysis process or investigation process, must ask “How did the system fail the individual” and “Why did the system fail the individual”. By shifting from a blame mindset to a continual improvement approach, organizations can:

  • Reduce costly errors and downtime
  • Improve employee engagement (less frustration = higher productivity)
  • Enhance conformity and compliance
  • Increase process reliability and efficiency

Monitoring the system will continue for as the context changes the controls implemented may not be as effective as before. A proactive system will not guarantee that things never go wrong. When they do, however, the key is to dig deeper. Using tools like PDCA, FMEA, and RCA will help in identifying long-term solutions to recurring problems. Because in most cases, fixing the system is better than blaming the human.

One-Off or Systemic: The Search for Root Causes

by Julius DeSilva

Accidents and failures, whether in maritime, aviation, healthcare, or nuclear settings, are often subjected to intense scrutiny to determine their root causes. However, the challenge lies in distinguishing whether an event is an anomaly or a symptom of a deeper systemic issue. This analysis is crucial as it directly influences the actions taken to prevent a recurrence or occurrence elsewhere. A management system approach, such as those outlined in ISO 45001 for occupational health and safety, ISO 9001 for quality management, or ISO 14001 for environmental management, provides a structured framework for systematically and proactively addressing risks when data exists.

Analysis of root causes: systemic failures

Root cause analysis is a fundamental investigative tool used to trace an incident to its origins. However, many organizations focus on immediate, apparent causes rather than examining systemic contributors and true root causes. Systemic failures result from weaknesses in policies, processes, or culture, and therefore, often recur in different forms over time.

The management system approach advocated by ISO standards and other industry-specific standards like the ISM code emphasize continual improvement and risk-based thinking. The intent of these standards is to reduce the probability of systemic failures by integrating safety, quality, efficiency, security, and environmental management into everyday operations.

Systemic failure example: Chernobyl

I recently read the book Midnight in Chernobyl, which outlined the 1986 Chernobyl nuclear disaster and the underlying systemic failures that contributed to this incident. Unlike isolated accidents, Chernobyl resulted from a combination of design flaws, operational errors, and a deficient safety culture. Key systemic issues included:

  • Design flaws. The RBMK reactor used in Chernobyl had an inherent positive void coefficient, meaning an increase in steam production could accelerate the reaction uncontrollably.
  • Operational failures. A safety test was conducted under unsafe conditions, including a reduced power level and disengaged emergency shutdown mechanisms.
  • Cultural and regulatory gaps. A lack of safety culture, insufficient training (and thus competency), and an authoritarian management style amounting to complacency discouraged questioning of unsafe practices.

These root causes culminated in an explosion that released massive amounts of radioactive material. European countries are so tightly packed that winds freely spread the outfall without borders. The systemic nature of the disaster was later addressed through international nuclear safety reforms, including the establishment of the International Atomic Energy Agency’s safety standards and stricter ISO frameworks such as ISO 19443, which outlines quality management system requirements for organizations working within the nuclear sector.

Other systemic failures

Deepwater Horizon oil spill (2010)

Another example of a systemic failure is the Deepwater Horizon oil spill. This incident was not merely the result of a single mistake but a consequence of systemic lapses in safety practices, regulatory oversight, and risk management. Contributing factors included:

  • Cultural deficiencies. The organization prioritized cost cutting over risk mitigation
  • Inadequate risk assessments. There was poor well-integrity testing and misinterpretation of pressure data.
  • Regulatory weaknesses. There was insufficient government oversight and a lack of stringent industrywide safety protocols.

This catastrophe led to significant regulatory changes, including the implementation of stricter safety and environmental policies within the oil and gas industry, aligned with ISO 45001 and ISO 14001.

The Boeing 737 MAX crashes (2018, 2019)

The Boeing 737 MAX crashes further illustrate systemic failure. Investigations revealed that flaws in the aircraft’s Maneuvering Characteristics Augmentation System (MCAS) were not adequately addressed due to:

  • Design and engineering oversights. Critical safety features were made optional rather than standard.
  • Regulatory gaps. The FAA relied excessively on Boeing’s self-certification.
  • Organizational pressures. The corporate culture emphasized speed-to-market delivery over comprehensive safety testing.

This resulted in significant regulatory reforms, including tighter oversight and compliance with international aviation safety standards.

Fixes vs. systemic longer-term improvement

Addressing failures can be approached through quick fixes or long-term systemic improvements. Each approach has its advantages and disadvantages:

Quick fixes

Pros:

  • Immediate resolution of pressing issues
  • Cost-effective in the short term
  • Prevents further damage or loss

Cons:

  • Does not address underlying systemic issues
  • Can lead to recurring problems if not supplemented with deeper analysis
  • Often reactive rather than proactive

Systemic longer-term improvements

Pros:

  • Addresses root causes, reducing the likelihood of recurrence
  • Enhances organizational resilience and safety culture
  • Aligns with ISO management systems, ensuring continuous improvement

Cons:

  • Requires significant time and resources
  • May face resistance from stakeholders due to cultural inertia
  • Implementation complexity can slow down immediate corrective actions

A balanced approach is often necessary—implementing short-term fixes to mitigate immediate risks while developing long-term systemic improvements to ensure sustainable safety and risk management practices.

What if we cannot foresee all risks?

Even with rigorous management systems and risk assessments, not all risks can be predicted. Organizations must be prepared to address unforeseen risks through:

  • Resilient systems. It is important to develop adaptable and robust safety management frameworks that can respond effectively to new threats.
  • Proactive learning. The organization can encourage a culture of continuous learning and scenario planning to anticipate emerging risks.
  • Redundancies and safeguards. Implementing fail-fail safe redundancies and contingency plans can mitigate the effects of unforeseen events.
  • Stakeholder collaboration. Engaging industry experts, regulators, and other stakeholders to share knowledge can help improve collective risk awareness.

Despite the lessons from Chernobyl, 25 years later the Fukushima disaster occurred. An earthquake of this magnitude was not foreseen as a risk even though in 1896 (as highlighted by an engineer on the project) an earthquake of magnitude 8.5 hit near the coast where the reactor was to be built. After Chernobyl, the 1970s-built reactor in Fukushima was not upgraded with the latest safety features due to high costs. Japan’s nuclear industry had a history of regulatory complacency and reluctance to accept international recommendations

ISO 31000, which addresses risk management, emphasizes the importance of resilience and adaptability in the face of unpredictable risks. By fostering a commitment to learning and preparedness across the organization, businesses can better navigate uncertainties while maintaining operational safety and efficiency.

The benefits of a management system approach

A management system approach, as defined by ISO standards, provides the following advantages:

  • Structured risk management. ISO 31000 ensures systematic identification, assessment, and mitigation of risks.
  • Continuous improvement. The Plan-Do-Check-Act (PDCA) cycle described in ISO 9001, ISO 45001, and ISO 14001 encourages learning from incidents to prevent recurrence.
  • Organizational culture change. Implementing ISO standards fosters a risk-oriented mindset, reducing the likelihood of systemic failures.

ISO management systems, when implemented and sustained, can act as a preventive tool to proactively manage risk.

Conclusion

Understanding whether an accident is an anomaly or a systemic failure is critical in determining the appropriate response. Sadly, at times industry must incur the cost of the nonconformity to learn the lesson. Organizational “can-do” attitudes lead to risk normalizations where dangerous conditions are seen as normal. Further, organizational and demographic cultures do not encourage challenging authority or questioning of decisions. Absence of accidents, incident reports, and near misses give a false sense of complacency that things are working well. This may lead to over-confidence in decision making, lapses in regulatory oversight, and deferring of resource allocation to other “priorities.”

Systemic failures indicate deeper vulnerabilities requiring long-term corrective actions. The application of ISO management systems offers a proactive and structured approach to accident prevention, ensuring that organizations move beyond reactive responses to fostering a culture of continuous improvement and risk management. By embracing these principles, industries can mitigate systemic risks, ensuring safer and more resilient operations.

Note – The above article was recently featured in Exemplar Global’s publication ‘The Auditor’. Click here to read.

Can We Trust AI? 

We see the use of Artificial Intelligence or AI all around us in uses that may be visible to us as also in uses not directly visible to us. It is here to stay and as we learn to live with it, however, there remains a concern about whether we can totally trust AI. Hollywood may have painted a picture of the rise of machines that may instill fear in some of us. Fear of AI taking over jobs, of AI reducing intelligent human beings, and of AI being used for illegal purposes. In this article we discuss what actions can be taken by organizations to build trust in AI, so it becomes an effective asset. The idea is as old as 1909, EM Foster’s “The Machine Stops”. 

What does it mean to trust an AI system? 

For people to begin to trust AI there must be sufficient transparency of what information AI has access to, what is the capability of the AI and what is the programming that the AI is basing its outputs on. While I may not be the guru in AI systems, I have been following its development over the last seven to eight years delving into several types of AI. IBM has an article that outlines the several types of AI that may be helpful. I recently tried to use ChatGPT to provide me with information and realized the information was outdated by at least a year. To better understand how we can trust AI, let us look at the factors that contribute to AI trust issues.  

Factors Contributing to AI Trust Issues 

A key trust issue arises in the algorithm used within the neural network that is delivering the outputs. Another key factor is the data itself that the outputs are based upon. Knowing the data that the AI is using is important in being able to trust the output. It is also important to know how well the algorithm was tested and validated prior release. AI systems are run through a test data set to determine if the neural network will produce the desired results. The system is then tested on real world data and refined. AI systems may also have biases based on the programming and data set. Companies face security and data privacy challenges too when using AI applications. Additionally, as stated earlier there remains the issue of misuse of AI just as cryptocurrency was in its initial phases.  

What can companies do to improve trust in AI? 

While there is much to be done by organizations to address the issues listed above and it may take a few years to improve public trust in AI, companies developing and using AI systems can use a system-based approach to implementing these systems. The International Organization for Standardization (ISO) recently published ISO/IEC 42001 – Management System Requirements for Information Technology AI systems. The standard provides a process-based framework to identify and address AI risks effectively with the commitment of personnel at all levels of the organization.  

The standard follows the harmonized structure of other ISO management system requirement standards such as ISO 9001 and ISO 14001. It also outlines 10 control objectives and 38 controls. The controls based on industry best practices asks the organization to consider a lifecycle approach to developing and implementing AI systems including conducting an impact assessment, systems design (to include verification and validation), control of quality of data used and processes for responsible use of AI to name a few. Perhaps one of the first requirements that organizations can do to protect themselves is to consider developing an AI policy that outlines how AI is used within the ecosystem of their business operations.  

Using a globally accepted standard can deliver confidence to customers (and address trust issues) that the organization is using a process-based approach to responsibly perform their role with respect to AI systems. 

To learn more about how QMII can support your journey should you decide to use ISO/IEC 42001, or to learn about our training options, contact our solutions team at 888-357-9001 or email us at info@qmii.com.  

-by Julius DeSilva, Senior Vice-President

Are Medical Audits Improving Systems Or Only Driving Fixes? 

Is there a potential downside to medical audits wherein the audits are focused on finding and fixing problems? A recent discussion with a medical professional piqued my interest in the value of Medical Audits given that QMII, a subject matter expert in auditing, has ventured into the medical auditing field. This led to a conversation with a few additional healthcare professionals to understand a little more about medical audits, their findings and how organizations address them. My additional reading outlined a lack of effective systemic corrective action. In this article, I discuss some aspects of the medical audit process and what organizations can do to improve the process of audits and of implement corrective action.  

There are various types of medical audits including clinical audits, billing/coding audits, financial audits, operational audits and compliance audits. While there are regulations, protocols and standards against which these audits are conducted, in many cases, industry-best practices are also used as audit criteria. This brings subjectivity into the audit as ‘best practices’ knowledge may vary from auditor to auditor based on their experience. Auditing to an auditor’s experience has a major drawback not just in the medical industry but in all industries. It takes the auditors away from requirements which then results in biased inputs to the leadership that may be inaccurate.  This also leaves the auditee (the organization being audited) on the receiving end of findings for which there are no certain requirements. That is, they may make changes to their system based on the finding of one auditor only to find that another auditor objects to the very actions they implemented based on the previous auditor. 

Medical Audits and Recommendations 

In medical audits, it is common practice for auditors to provide recommendations to address findings. These recommendations are based on experience and industry-best practices. In ISO audits this is not allowed. In most industries, including the healthcare industry, there is no obligation to act upon any of the recommendations of an auditor. However, if auditors are perceived to be in a position of authority, then there is an underlying implication that the audit recommendation must be implemented. This is for fear of the nonconformity occurring again only for someone to say, “the auditor told you what to do and no action was taken”. This then also implies, audits do not delve deeply enough to identify systemic weaknesses within the processes or the workflow. 

In speaking with the medical professionals within my professional circle of friends, it was surprising to hear that in many cases the personnel being asked to address the audit findings are unaware of any root cause analysis methodologies nor have they been given any formal training in the subject. Further, they are not clear about what a CAPA is but do know that they need to provide some action to close out the finding. In such cases, is it then fair to expect effective corrective action? Perhaps, the lack of effective corrective actions perpetuated the need for auditor recommendations! 

Without proper training, it is but natural for personnel responding to audit findings to default to the recommendations of the auditor and implement those actions prescribed by the auditor as the corrective action in and of itself. Sadly, in such cases the root cause of the issue goes unaddressed. Sometimes such cases may lie in inadequate resources, technology or even lack of guidance/policy from leaders. While the aim of the audits is to identify where the process may require additional controls, all for providing better healthcare for the patient, the outcome may only be a band-aid. 

What can be done to change this? 

While change may not come overnight, there are a few key steps that can be taken to improve the audit process overall right up until corrective action and meet the end goal of providing better healthcare.  

Auditor training – Auditors must be trained to remain objective through the audit process, to focus on the requirements (criteria) of their audit, to focus on factual evidence and objectively assess it (yes, no experience!). Further they must understand the implications of providing recommendations and thus not provide any recommendations. The auditors are but to focus on assessing the effectiveness of the corrective action plan submitted and verifying the effectiveness of actions taken.  

Root Cause Analysis Training – Healthcare organizations must invest in providing their personnel with training in the different root cause analysis methodologies and how to apply it to identify the root cause(s) of a problem.  

Reinforcing that Recommendations need not be accepted/addressed – Organizations must be professional to build the courage to stand up to auditors and not accept recommendations. Auditors do not know all facets of the process from the short sample of the organization they witness. If their “advice” in the recommendations is wrong/ineffective, who then pays the price? 

Auditor Selection – ISO 19011 provides guidance on the behaviors and skills that an auditor should exhibit, and these are applicable to an auditor selected to conduct any type of audit. Auditors must be evaluated periodically to ensure they are remaining objective through an audit and working to identify the effectiveness of controls and adequacy of resources in assessing if the overall objectives have been met. To learn more about how QMII can support your organization’s audit process, click here

Julius DeSilva, Senior Vice-President