Fault tolerance is "the property that enables a system (often computer-based) to continue operating properly in the event of the failure of (or one or more faults within) some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure..." (1)
Fault tolerance is integral to systems engineering of critical systems.The assumption is that systems will fail and that the impacts of those failures must be controlled. An example approach would be to mandate that a system be two fault tolerant. This would then drive the design of three inhibits, any one of which could prevent the failure of the system. For a physical system, the failure may be physical failure. For an information system, the failure will likely be loss of confidentiality, integrity, or availability of critical data or a critical business process.
Fault tolerance can and should be applied to infosec as well. When we plan our defenses, we should assume that some defenses will fail to mitigate the risks they were designed to prevent. We should assume that some vulnerable conditions will be taken advantage of. To plan infosec inhibits, we must understand the attack path between the causes of our risk (likely a threat) and where the consequences are realized. It is along these paths that we can then document inhibits. Whether it be a firewall, code whitelisting, requiring a written form before receiving privileges, or detecting and responding to threats before consequences are realized, we can capture those along the attack path. Currently we refer to this as 'defense in depth'. We all have a general understanding of what it means, but the above approach provides a concrete and quantifiable method for implementing it.
A few months ago there was a significant discussion on whether pen test teams should have a 'zero day' card they could use during a pen test. What this is really asking is, "Should we test whether the system's backup defenses can be relied upon in case of a failure of primary defenses?" Or, in other words, "Is the system fault tolerant?" The answer is we should absolutely include a 'zero day' card. Part of any pen test should be to articulate which defenses, inhibits, or mitigating conditions (whatever you call them) failed. However, it a system is supposed to implement 'defense in depth', then a 'zero day' card allows us to test that assertion by simulating the failure to see how well the system's defenses handle a failure. The documentation simply needs to be able to articulate that a failure was simulated to test the fault tolerance of the system.
Ultimately, we should plan for our information systems to be fault tolerant, even under purposeful information security attack. One fault tolerant, two fault tolerant, three fault tolerant... Regardless the standard set, not planning for fault tolerance is assuming the system's defenses will function perfectly. An inherently flawed assumption.