System Safety, Part 3: Identifying Overlooked Practices

Written by Sonia Bot and Tony Zenga
image description

Editor’s Note: September is Safety Month in the North American railway industry. This month, Railway Age “recalls to active duty” the three-part series on System Safety by Sonia Bot and Tony Zenga, with accompanying podcasts. – William C. Vantuono

RAILWAY AGE, DECEMBER 2020 ISSUE: Industry 4.0 (also known as the Fourth Industrial Revolution) is a reality. Railroads, including their partners in the transportation supply chain, are at the beginning of their journey to establishing true end-to-end digital continuity. For example: Industrial Internet of Things (IIoT); Positive Train Control (PTC) and Enhanced Train Control (ETC); and AI (artificial intelligence)-based automation such as expanding autonomous inspection to include predictive analytics for track data. How do we know that these solutions and systems are safe and that there are no lurking issues? How do we know that the integration of multiple components from vendors, partners, and even from within meet safety objectives? How do we know if safety integrity is preserved after a change is made? How do we shift the paradigm where safety moves from a cost center to a value-added business driver?

In Part 1 (October 2020), we made the case for system safety as the necessary discipline for railroads to embed as they move forward in innovating and advancing in the 21st century.

In Part 2 (November 2020), we stepped through proven guiding principles, how they can be applied to embedding system safety, and resulting paradigm shifts; all with the goal of improving safety performance and opening up new opportunities for revenue streams.

In Part 3, we draw attention to three often neglected or not fully understood aspects of system safety practices. They are integration into the systems engineering lifecycle, designing for safety, and process-based safety performance management. By having these technical aspects in place, a system safety practice can effectively achieve its potential and influence the maturation of the organization’s safety culture.


System Safety Engineering is the processes used to prevent accidents by identifying and eliminating or controlling hazards. Hazards are system states or conditions that, together with a particular set of worst-case environment conditions, will lead to unsafe circumstances.

We often find that system safety is not connected or is in isolation from the systems engineering lifecycle. Safety ends up being handled as postmortem or backward-looking assurance activity. The domino effect kicks-in. Safety-related design flaws are found late and are awfully expensive to fix. Arguments arise on the validity of the design flaws—trying to show they don’t need fixing. When these arguments fall apart, the efforts to deal with safety design flaws are often expensive and the solutions are not highly effective. Redundancy (through processes, materials and software apps) is bolted on, where an optimized design would not have required this. Nowhere near ideal procedural mitigations are imposed on the operators, with the hope of better safety if they are followed.

By integrating system safety practices into the systems engineering lifecycle, costs of engineering for safety are considerably reduced, while increasing safety effectiveness and outcomes. Naturally, re-work (a form of waste) is decreased, which in turn compresses project schedules, lowers costs and lowers risks.

Figure 1: System safety is present at every stage of the systems engineering lifecycle.

Figure 1 shows where key system safety elements fit within the systems engineering lifecycle. These elements address the various levels of safety: component failures, subsystem hazards, functional hazards, operating- and support-related hazards, software anomalies, system safety, and system of systems safety. The lifecycle is driven and monitored by business and safety objectives. The system safety practices and deliverables elegantly fit into system development and integration lifecycles (for example, V-model, Disciplined Agile, Dev-Ops).

Preparation and approval of a system safety program plan is done at the start of a system safety program and is monitored and managed throughout the life of the program. It is a management document that describes the system safety objectives and program requirements. It provides regulators and managing or contracting agencies a basis of understanding on how the safety hazard management efforts will be integrated into the systems development or system integration process. There are seven components:

An Operating and Support Hazard Analysis is performed to identify hazards that may arise during operations of a system and to recommend risk reduction alternatives or constraints during all phases of tasks or operations to ensure safety-related risks are controlled or eliminated.

A Subsystem Hazard Analysis is used to identify design hazards in subsystems of a larger major system. The analysis evaluates functional failures or hazardous functions of the subsystem that may result in accidental loss.

A System Hazard Analysis examines the entire system for its state of safety. It integrates the essential output of the subsystem hazard analysis to identify safety weaknesses in the total system design by analyzing the system interfaces, including safety-critical human activities or omissions. Similarly, the system of systems hazard analysis examines the entire system of systems for its state of safety.

A Component Level Analysis is done to provide a systematic evaluation of the components based on their failure mode and effects they have at the component (local effect), subsystem (next higher), or system level (for example, train level). Each failure mode is categorized in terms of safety or service affecting severity criticality on the system being evaluated.

A Functional Hazard Analysis examines the system functions to identify potential functional failures or functional behavior, and classifies the hazards associated with specific functional outcomes of failure conditions. The Functional Hazard Analysis is developed early in the development process and is updated as new functions or failure conditions are identified. The Functional Hazard Analysis is an ongoing activity throughout the design development cycle.

Safety Verification provides the necessary evidence and results that hardware, software and procedures verify compliance with the safety requirements. As a closed-loop activity, the safety verification is performed by means of analyses, tests and demonstrations. When the safety test results show that the system behaves as specified, then it can be permitted to enter revenue service and monitored as changes take effect during the system lifecycle operation.

The Safety Case is a written demonstration of evidence and due diligence provided by an organization, such as a railroad, to demonstrate that it can operate within the risk safety margins defined in the system safety program plan.

Table 1 provides some examples of common gaps in the execution of system safety programs with consequences that lead to hazards and mishaps occurring. Intrinsically, these gaps also lead to system safety projects failing or requiring unsustainable overwork. By integrating system safety into the systems engineering lifecycle, these issues are mitigated.

Table 1: Examples of system safety program gaps that can serve as early
indicators for hazards and mishaps to occur.


System safety is well poised to deliver superior safety solutions because of its risk-based strategy and systems engineering approach. By having a framework outlining the design order of precedence, as shown in Figure 2, more intentional safety solution design strategies are possible. Those tasked with implementing solutions for controlling and minimizing identified hazards work through each of the levels, with the most effective control measure at the top and least effective at the bottom. 

Figure 2: System Safety Design Order of Precedence.

Stepping through the levels of the control approaches, the best control possible is to eliminate the hazard completely. Here the hazard will no longer exist and cannot cause any harm. The second-best option is to reduce the risk by altering the design, such as substituting the hazard with a different one that does not include the risk or as much risk. The next option is to isolate the hazard, by incorporating engineered features or devices, so that it is separated from people and resources that it can harm or damage. This option is followed by the option to engineer controls such as making engineering changes to the hazardous situation to safeguard against the hazard, namely providing warning mechanisms. The lowest and least effective level of hazard control deals with incorporating signage, procedures, training and personal protective equipment (PPE). This should be the defense of last resort; however, many see and use this as the first line of hazard mitigation.

For example, let’s look at the hazard where train overspeed results in a derailment. At the lowest level of protection, track signage and speed-related bulletins and procedures are provided for the train crew. However, these are highly prone to human error. The introduction of PTC raises the safety protection by providing speed warnings and other engineered features such as stopping the train if the speed warnings are not properly addressed. As much as PTC reduces the overspeed derailment risk, it does not fully eliminate the overspeed derailment hazard. Imagine the full elimination of this hazard, with a solution that prevents trains from overspeeding in the first place; and trains programmed in real time to travel at the specified speeds according to track, weather, consist and other relevant conditions.

Another example deals with electric trains docked at their maintenance shop where they are powered through a 750 VDC facility connector. The hazard is that the train can move unintentionally. One consequence is that it collides with personnel, causing bodily harm or fatalities. Providing signage, floor markings, procedures and training for personnel to stay out of the way of potential unintended train movement is the lowest level of protection. The next level of protection would include a dual visual-auditory warning whenever personnel cross hazardous boundaries. Higher levels of protection address the hazard at the source, such as tethering the train to prevent it from moving or reducing the power through electromechanical interlocks to safer levels to disable train propulsion.

The train overspeed derailment example shows a multi-generational and high investment approach to mitigating the hazard. Meanwhile, the electric train example is one that adds no cost to a program when addressed upfront in the assessment and definition stages of the systems engineering lifecycle; however, it becomes very costly if addressed later in the lifecycle.

A typical mistake that organizations make is in choosing a control method because its implementation is fast and easy, regardless of its required effectiveness. Rather, each layer of the hierarchy must be assessed on its own merit from broader business feasibility criteria. Sometimes controls need to be introduced in stages, such as in a dire emergency, adopting PPE and administrative procedural controls until a more permanent and safer solution is in place.

While the best approach to every hazard is to eliminate it completely, this may not always be possible or easy to do. Tradeoffs may be required. Tradeoffs should be selected by conscientious and systematic design. Regardless of tradeoffs, the hierarchy of hazard controls still lays out where the hazard control and its associated risk reside, so that organizations and the collective partners of the ecosystem are aware of where they truly stand.


Organizations often overcomplicate performance measurement, monitoring and control, even when it comes to safety. Often, KPIs (Key Performance Indicators), measures and metrics are confounded, and data collection is haphazard. More effort and resources are spent unnecessarily. Risks of making mistakes with understanding the messages in the data, gaming the system for vanity, and ultimately misleading the business are heightened. A process-based approach alleviates these issues.

KPIs must be defined top-down with traceability from one process level to the next through a KPI tree. The KPIs must be explicitly attached to a process step, where it is clear-cut for “where and when” to take measurements, providing a clear starting point for working through investigations and interventions. See Figure 3.

Figure 3: Relationship between Process Levels and Key Performance Indicators (KPIs).

KPIs are specifically designed to flag performance issues so that direct action can be taken. Measures and metrics do not have this feature. Instead, they are useful for investigative purposes after an issue is flagged by the KPI. KPIs, measures and metrics should be restricted to the vital few. If you have more than a handful for a process area or level, then it’s time to simplify.

KPIs are designed to specific financial, customer and safety quality requirements. After all, a company must be good at making money top-line and bottom-line (financial KPI), attracting and retaining customers (customer KPI), and operating safely (safety KPI).

To define effective KPIs, a process architecture must be in place. Often, the process architecture definition is missing, and the various levels of process are heaped together. This results in processes that are misaligned and poorly adaptable. Digitizing processes that are not properly designed is an exercise in futility. Instead, process areas, processes, sub processes, and detailed process flows must all be tied together in a process hierarchy. Any supporting technology solution and system must also be systematically tied into the process, as well as the data flow. As such, the traceable KPI tree can be readily put in place from the top level through to the lower levels of the enterprise.

Many organizations only define outcome KPIs, where one needs to wait for the unfavorable outcome to occur before taking corrective action, thereby making the response a reactive one. The proactive and more effective and lower risk approach is to couple the outcome indicators with predictive ones, so that any corrective action can take place well in advance and thwart the occurrence of the unfavorable outcome. It is imperative that the predictive indicators are statistically correlated with their respective outcome indicator; otherwise there is no proof that they are true and reliable predictive indicators.


Railroads are committed to the opportunities fueled by Industry 4.0 and are in the early stages of transformation. Industry 4.0 requires system safety; this is not an option. Safety is embedded in every aspect and function of the railroad.

By embracing a holistic system safety program, railroads can shift safety from a business cost center to a value-added business driver. A place to start is by focusing on often overlooked or poorly understood key practices such as integration into the systems engineering lifecycle, designing for safety, and process-based safety performance management.

Guiding principles that focus on rewarding an entrepreneurial culture, exercising business rigor and relevancy, forging productive partnerships, safeguarding end-to-end flow, and fostering a learning organization well position system safety as a mechanism for railroads to leap ahead. Railroads must move quickly, though, or they risk being left behind.

Listen to the Rail Group On Air Podcast: Interview with Sonia Bot and Tony Zenga on Safety Doesn’t Happen by Accident – System Safety Coming of Age, Part 3: Identifying Overlooked Practices.

This article is based on the novella-sized white paper, “System Safety as a Value-Added Business Driver: The Evolution of Railroading in the Eras of Technology and Innovation.” (Bot & Zenga, July 2020).

Sonia Bot, chief executive of The BOT Consulting Group Inc., has played key roles in the inception and delivery of several strategic businesses and transformations in technology, media and telecommunications companies worldwide. By utilizing methodologies in entrepreneurship, business precision, Lean Six Sigma, systems and process engineering, and organizational behavior, she’s enabled organizations to deliver breakthrough results along with providing them a foundation to continue to excel. She was instrumental in PTC implementation on CN’s U.S. lines. Her approaches on the evolution of railroading and transportation are game changers that drive innovation and competitive advantage for adopters in a changing industry. Sonia can be reached at i[email protected].

Tony Zenga, owner of CMTIGroup Inc., is an accomplished specialty engineering consultant with international experience in operational reliability and safety for mission critical systems. He has played key roles in the implementation of System Safety engineering programs for aerospace, defense, high tech, mass transit and rail infrastructure projects worldwide. By leveraging his design and development experience of large-scale safety-critical systems, combined with his systems engineering knowledge, he enables organizations to deploy their systems safely into field operation. As advisor to CN, he was instrumental in the development of the PTC system safety engineering safety case and the creation of the system safety engineering discipline. Tony can be reached at [email protected].

Tags: , ,