Subjects covered

  • Defining failure
  • Causes of failure
  • Anticipating failure
  • Reliability analysis
  • Risk assessment
  • Risk and margin management
  • Decision making under uncertainty
  • Human factors
  • Fault and event tree analysis
  • Failure modes, effects and criticality analysis
  • Understanding the state of a system
  • Resilience and recovering from failure
  • Case study: Shuttle disasters
  • Case study: Deepwater horizon
  • Quality
  • Lean and value engineering

Risk, Reliability, Resilience


Producing systems that deliver value to their stakeholders requires us to understand the ways in which technology may fail to perform as required. This module investigates in detail how we can estimate the probability and impact of system failure using techniques such as Failure Modes, Effects and Criticality Analysis (FMECA), Event Trees and Fault Trees.

It also discusses the concept of risk and variability in performance, and investigates ways in which we can anticipate failure by understanding both technological and human factors that may predispose a system to failure. Through the use of theory and in-depth case studies, we discuss how decisions are made and how system failure can result from poor individual or group decision making.

We explore in this module a range of different interpretations of risk, differentiating between individual risk events (which might threaten a project's ability to deliver its objectives) and the overall level of exposure to risk against different dimensions (as shown in the figure below).

Risk Triangle


One measure of reliability is the mean time between failures (MTBF) - the elapsed time between successive failures of a system in operation. But how much reliability is enough and how much can we afford to pay for?

Our understanding of reliability and the likelihood of failure starts to develop as the system is built. A key element in the management and control of a system's development is the knowledge of the baseline configuration at any point within the process. The baseline refers to the configuration items (CIs): the parameters whose values will ultimately define the performance of the system.

Configuration management is a process that is concerned with identification, control and traceability of these baselines. Effective configuration management is used to
ensure that the status of each item is fully understood. A series of tests, variously called verification, validation and/or acceptance tests, are carried out at numerous levels. Testing of the product, or of individual parts of it, can reveal faults that have to be corrected in a controlled way, and the configuration management process assists in this as well.

The content and purpose of these tests will have been defined beforehand in the earlier planning stages. Part of this planning will determine which aspects of the product are to be tested in what way – the test matrix. There is a balance to be found between the thoroughness (and therefore cost) of the testing activities and the desired quality of the product.


No matter how well designed a product is, failure of one sort or another is inevitable eventually. Resilience is a measure of a system's ability to bounce back from a failure to continue to offer some level of performance (possibly not the original level of performance). For safety-critical technologies, the ability of a system to fail safely, or for performance to be restored as quickly as possible, is a major concern.

Page last modified on 09 dec 13 12:36