Rethink our view towards single-points-of-failure.

Andrew Bissot
Feb 1, 2022
4 min read

The new James Webb Space Telescope claims to have 344 single-points-of-failure (SPOF) or single components that could disrupt the intentions of the mission. 344-mechanisms, that if they fail individually, the mission is compromised. This amount of stress is not for the gentle. Their functions do not have redundancies within the system and instead either immediately compromise the mission or require the implementation of recovery scenarios. Within manufacturing, how do should we evaluate our SPOFs to determine our acceptable risk and condition our strategies to mitigate disruptions? To face the challenges of managing risk that accompany SPOFs, we should start by simplifying our thinking with two strategic views; a micro and macro view.

Those who are not worried or even terrified about this are not understanding what we are trying to do.

The first view is a micro view, looking at the finite inner webbing of a microsystem of an individual process. Consider a single router that enables internet connectivity within an underground mining operation. If this single router failed, all of the components that use this connectivity are immediately disconnected and cannot function. Another example closer to home is the fan motor in a house’s central air conditioner condenser. If the fan motor failed, immediately, the rest of the system is incapable of performing its functions of cooling on a hot summer day.

Or consider a single chain of a hundred individual links, holding up an offline roll on a process line. If any single link was to fail, the chain fails, resulting in gravity propelling its strength to drop the once secured roll onto the process line. These are examples of single-point failures that prevent the sub-system’s ability to conduct its intentions.

The second is the macro view that evaluates a collection of microsystems to produce a product or achieve an overarching goal. One example is the Death Star in Star Wars had one microsystem that affected the entire macrosystem. This microsystem was exposed in the Battle of Yavin by the Rebels, enabling Luke to fire two proton torpedoes down an exhaust port that led to the main reactor. Another example could be the abilities of the Medellin Cartel and the death of Pablo Escobar. This cartel was at its pinnacle with Pablo Escobar in control, yet when removed, competing organizations seized new markets due to a lack of an effective succession plan.

One last example could be for an organization after a 20-year patent expires and the value of the invention transitions into the public domain. This is particularly fierce in the medical sector for prescription drugs, where when a patent expires competition comes at lower prices to capture the market. These failures resonate by impacting a collection of systems, where when one is removed, the holistic intent is compromised.

Both the macro and micro SPOF require different approaches to minimize the exposure from failure. On the microsystems, many proactive activities monitor these SPOFs. Consider activities such as predictive or preventive maintenance intended to reduce exposure of unplanned failures and the possibility of collateral damages. Properly conducting and implementing failure-mode-effects-analysis (FMEAs) allows the organization to perform the needed inspections or justify the insertion of redundancies within the circuitry.

From a macrosystem perspective, organizations implement actions through things like scenario activities, fire drills, or within specific engineering functions focused more towards a design for reliability. Organizations might consider activities like war games to detect points of weakness or where redundancies exist. These war games play out scenarios and what-ifs in preparation for it occurring.

Redundancies in the macro view are not only tangible, they might be strategic when applying them within supply chains. Consider the spares procurement strategies that have alternative suppliers approved for refurbishments. The saying "don't put all your eggs in one basket, the basket may fail to drop all of your eggs" is applicable here. Another example could be succession plans in organizations where there is only one person that is qualified to do a job, and if the job isn't done, the company cannot remain in business. These succession plans allow the organization to remain nimble in the event of any circumstance that disrupts the continuity of the leadership.

Every battle is won before it is fought.

Back to the Death Star example. Think back to the example of the Death Star having a micro SPOF not recognized by Emperor Palpatine. This vulnerability was not completely recognized, but instead, he recognized his macro view was vulnerable. He saw that he had a macro SPOF with one Death Star to complete his mission, so two Death Stars were needed.

Regardless of whether we are looking at SPOF in a micro view or a macro view, organizations should strive to minimize to a point of diminishing return. Organizations cannot instill mitigations for every scenario and organizations cannot have a spare for everything. Organizations have to accept an effective level of planning to accomplish the mission and then produce.

We cannot perform preventive maintenance all of the time and Emperor Palpatine most likely never had plans for a third Death Star. At some point, we have to perform with what we have planned for. Whereas, having zero SPOFs is not always the optimal solution. What separates good organizations from great organizations, are those that find this intersection of value and cost of eliminating the SPOFs. Mike Menzel, the lead mission systems engineer noted this intersection for the James Webb Space Telescopes when he stated,

“We found the sweet spot between getting the control that we want, with these large flexible membranes, without adding too many single points of failure.”