Understanding Fault-Tolerance In Web Applications

2025/04/06

Why Web Application Fault Tolerance Matters?

To keep applications working, the internet relies on many components like DNS servers, web servers, load balancers, routers, and physical connections. These parts can, and often do fail. To understand how applications stay available despite these failures, we can think about fault tolerance operating at different layers, starting from the application itself down to the underlying network plumbing:

Let’s explore how fault tolerance works within each of these layers.

Layer 3: Handling Web Application Failures

web_server_failure.excalidraw.svg

We generally run multiple interchangeable web servers behind load balancers. If a web server fails, the load balancer detects the failed server (using health checks), and upcoming requests are routed to other healthy servers.

Layer 2: Infrastructure Fault Tolerance (Load Balancers and DNS)

Handling Load Balancer Failures

load_balancer_failure.excalidraw.svg

Using just one load balancer creates a single point of failure. Reliable services often use DNS load balancing and have multiple load balancers behind DNS (e.g., github.com). DNS service providers like Cloudflare and AWS Route 53 run their own health checks on these load balancers. If a load balancer fails, the DNS service finds out and stops returning its IP address in response to DNS queries.

However, clients may not immediately recover, as DNS responses can be cached locally. If a client’s cache has the IP of a load balancer that just failed, it might keep trying to reach it until the cache expires. To minimize this problem, multiple techniques are used:

  1. DNS Servers return multiple IP addresses during resolution, and the client can try other IP addresses if the first one fails.
  2. DNS records use short TTLs (Time-To-Live). This forces clients to perform DNS resolution more frequently, reducing the chance of using a bad IP for too long.

Alternatively, instead of relying solely on DNS load balancing with multiple IPs, load balancers themselves might use Anycast IPs behind a single DNS entry. We will discuss the Anycast approach next.

Keeping DNS Servers Working (Anycast)

The DNS system itself needs to be reliable. This includes the DNS servers for your applications (specific authoritative servers for example.com), Root servers(servers for domains like .com) and other DNS resolvers in between. What if one of these DNS servers fails? The solution is often Anycast IP Addressing.

dns_server_failure.excalidraw.svg

Anycast is a clever technique where many servers in different locations around the world all share the exact same IP address. They announce this shared IP address using BGP (Border Gateway Protocol). Internet routers see multiple paths to this Anycast IP. They usually send your request, like a DNS query, to the server that seems “closest” based on network topology (meaning the fewest network hops or lowest latency/cost).

If one of these Anycast servers fails or becomes unreachable, it stops announcing the shared IP, or the path to it becomes less desirable. BGP routers automatically detect this change. They then reroute subsequent requests for that same IP address to the next best available server. This failover usually happens seamlessly, without the user noticing.

Anycast is widely used for DNS infrastructure at each layer, including DNS resolvers (like Cloudflare’s 1.1.1.1 or Google’s 8.8.8.8), authoritative domain servers (like those managed by AWS Route53), and even the Root servers which have over 1900 instances globally sharing just 13 core IPs (https://www.iana.org/domains/root/servers, https://root-servers.org/).

Large networks can also use Anycast at their edge and then route traffic internally to specific healthy machines without advertising any change externally via BGP.

Layer 1: Network Layer Failures (BGP)

Finally, how does traffic reliably travel between the different major networks (like ISPs and large tech companies, called Autonomous Systems or ASes) that make up the internet? This is managed by BGP (Border Gateway Protocol).

bgp_router_failure.excalidraw.svg

BGP’s resilience comes from path redundancy. Most networks connect to multiple other networks using various physical links. BGP routers learn and track all these possible paths to destinations across the internet. If one path fails (maybe a cable is cut or a router goes down) BGP routers detect the problem. They then recalculate and choose the best alternative path available. Traffic automatically flows around the failed section. While BGP makes the network resilient, it’s possible, for all paths between two points to fail simultaneously, which would cause a temporary outage for that specific connection.

Summary: Fault Tolerance Comes in Layers

Fault tolerance is built into each layer of the internet. Each layer often includes health checks and mechanisms to handle local failures, contributing to the reliable online experience we depend on.

Further things to explore