Reliability

Summary

Reliability in applications refers to their ability to function correctly even when things go wrong. This involves anticipating and managing faults, which can be hardware, software, or human errors. Hardware faults, such as disk failures or power outages, can often be mitigated through redundancy like RAID setups or backup generators. However, as systems scale, software faults become a greater concern, as bugs or unexpected interactions between components can lead to cascading failures. Human errors also pose risks, but systems can be designed to minimize their impact through testing, isolation, and clear monitoring. While reliability can sometimes be sacrificed for cost or speed in certain cases, ensuring dependable performance remains crucial, as even non-critical applications can face significant consequences from failures, including financial loss, reputational damage, or emotional distress for users.

Details

In short, reliability means that the application is continuing to work correctly, even when things go wrong.

Faults

The things that can go wrong are called faults. Systems that anticipate faults and can cope with them are called fault-tolerant or resilient. It only makes sense to talk about tolerating certain types of faults.

Fault is not the same as failure. A fault is usually defined as one component of the system, deviating from its spec. Failure is when a system as a whole stops providing the required service to the user.

Its usually best to design fault-tolerence mechanisms that prevent faults from causing failures.

Hardware faults

Hard disks crash
RAM becomes faulty
The power grid has a blackout
Someone unplugs the wrong network cable

Hard disks are having a mean time to failure (MTTF) of about 10 to 50 years.

Possible solutions:

Disks may be set up in a RAID configuration
Servers may have dual power supplies and hot-swappable CPUs
Datacenters may have batteries and diesel generators for backup power.

When one component dies, the redundant component can take its place while the broken component is replaced.

Until recently, redundancy of hardware components was sufficient for most applications, since it makes total failure of a single machine fairly rare, and long as you could restore a backup. However as data volumes and applications’ computing demands have increased, more applications has begun using larger numbers of machines which proportionally increases the number of hardware faults. There is a move towards systems that can tolerate the loss of entire machines by using software fault-tolerance techniques in preference or in addition to hardware redundancy.

Such systems have operational advantages. A single-server system requires planned downtime if you need to reboot the machine. A system that can tolerate machine failure can be patched one node at a time without downtime of the entire system.

Software errors

Systematic errors are harder to anticipate, and because they are correlated across nodes, they tend to cause many more system failures than uncorrelated hardware faults.

Examples could be:

A software bug that causes every instance of an application server to crash when given a particular bad input.
A runaway process that uses up some shared ressource like cpu time, memory, disk space or network bandwidth.
A service that the system depends on that slows down, becomes unresponsive or starts returning corrupted responses.
Cascading failures, where a small fault in one component triggers a fault in another component, which in turn triggers further faults.

The bugs that cause these kinds of software faults often lie dormant for a long time until they are triggered by an unusual set of circumstances. There is no quick solution to the problem of systematic faults in software. Lots of small things can help:

Carefully thinking about assumptions and interactions in the system
Thorough testing
Process isolation
Allowing processes to crash and restart
Measuring, monitoring, and analyzing system behaviour in production.

Human errors

Even when they have the best intentions, humans are known to be unreliable. How do we make our systems reliable, in spite of unreliable humans? The best systems combine several approaches:

Design systems in a way that minimizes opportunities for error. For example, well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourage “the wrong thing.” However, if the interfaces are too restrictive people will work around them, negating their benefit, so this is a tricky balance to get right.
Decouple the places where people make the most mistakes from the places where they can cause failures. In particular, provide fully featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users.
Test thoroughly at all levels, from unit tests to whole-system integration tests and manual tests. Automated testing is widely used, well understood, and especially valuable for covering corner cases that rarely arise in normal operation.
Allow quick and easy recovery from human errors, to minimize the impact in the case of a failure. For example, make it fast to roll back configuration changes, roll out new code gradually (so that any unexpected bugs affect only a small subset of users), and provide tools to recompute data (in case it turns out that the old computation was incorrect).
Set up detailed and clear monitoring, such as performance metrics and error rates. In other engineering disciplines this is referred to as telemetry. (Once a rocket has left the ground, telemetry is essential for tracking what is happening, and for understanding failures.) Monitoring can show us early warning signals and allow us to check whether any assumptions or constraints are being violated. When a problem occurs, metrics can be invaluable in diagnosing the issue.
Implement good management practices and training—a complex and important aspect, and beyond the scope of this book.

How important is reliability?

There are situations in which we may choose to sacrifice reliability in order to reduce development cost (e.g., when developing a prototype product for an unproven market) or operational cost (e.g., for a service with a very narrow profit margin)—but we should be very conscious of when we are cutting corners.

Importance of Reliability: Not limited to critical systems like nuclear power stations or air traffic control software; even everyday applications need to be reliable.

Consequences of Bugs:

Business applications: Lead to lost productivity and potential legal risks if data is incorrect.
Ecommerce sites: Can result in significant financial losses and reputational damage.

Responsibility to Users:

Even in noncritical applications, reliability matters.
Example: If a parent loses photos and videos of their children due to database corruption in a photo app, the emotional impact can be significant. They may not know how to recover the data from backups.

INDBS Noter

Explorer