Why Web Application Fault Tolerance Matters?
To keep applications working, the internet relies on many components like DNS servers, web servers, load balancers, routers, and physical connections. These parts can, and often do fail. To understand how applications stay available despite these failures, we can think about fault tolerance operating at different layers, starting from the application itself down to the underlying network plumbing:
-
Application Layer (Layer 3): This is where the application’s code runs on web servers. Fault tolerance at this layer focuses on ensuring the application remains functional even if individual servers encounter issues.
-
Infrastructure Layer (Layer 2): This layer handles directing users to the right application servers. It includes critical components like Load Balancers (which distribute traffic) and DNS (which translates domain names to IP addresses).
-
Network Layer (Layer 1): Underlying everything are the physical connections and routing protocols (like BGP) that allow data packets to travel across the internet. This layer ensures traffic can find alternative routes if primary paths become unavailable.
(Note: don’t confuse these layers with the formal OSI model layers.)
Let’s explore how fault tolerance works within each of these layers.
Layer 3: Handling Web Application Failures
We generally run multiple interchangeable web servers behind load balancers. If a web server fails, the load balancer detects the failed server (using health checks), and upcoming requests are routed to other healthy servers.
Layer 2: Infrastructure Fault Tolerance (Load Balancers and DNS)
Handling Load Balancer Failures
Using just one load balancer creates a single point of failure. Reliable services often use DNS load balancing and have multiple load balancers behind DNS (e.g., github.com). DNS service providers like Cloudflare and AWS Route 53 run their own health checks on these load balancers. If a load balancer fails, the DNS service finds out and stops returning its IP address in response to DNS queries.
However, clients may not immediately recover, as DNS responses can be cached locally. If a client’s cache has the IP of a load balancer that just failed, it might keep trying to reach it until the cache expires. To minimize this problem, multiple techniques are used:
- DNS Servers return multiple IP addresses during resolution, and the client can try other IP addresses if the first one fails.
- DNS records use short TTLs (Time-To-Live). This forces clients to perform DNS resolution more frequently, reducing the chance of using a bad IP for too long.
Alternatively, instead of relying solely on DNS load balancing with multiple IPs, load balancers themselves might use Anycast IPs behind a single DNS entry. We will discuss the Anycast approach next.
Keeping DNS Servers Working (Anycast)
The DNS system itself needs to be reliable. This includes the DNS servers for your applications (specific authoritative servers for example.com
), Root servers(servers for domains like .com
) and other DNS resolvers in between. What if one of these DNS servers fails? The solution is often Anycast IP Addressing.
Anycast is a clever technique where many servers in different locations around the world all share the exact same IP address. They announce this shared IP address using BGP (Border Gateway Protocol). Internet routers see multiple paths to this Anycast IP. They usually send your request, like a DNS query, to the server that seems “closest” based on network topology (meaning the fewest network hops or lowest latency/cost).
If one of these Anycast servers fails or becomes unreachable, it stops announcing the shared IP, or the path to it becomes less desirable. BGP routers automatically detect this change. They then reroute subsequent requests for that same IP address to the next best available server. This failover usually happens seamlessly, without the user noticing.
Anycast is widely used for DNS infrastructure at each layer, including DNS resolvers (like Cloudflare’s 1.1.1.1 or Google’s 8.8.8.8), authoritative domain servers (like those managed by AWS Route53), and even the Root servers which have over 1900 instances globally sharing just 13 core IPs (https://www.iana.org/domains/root/servers, https://root-servers.org/).
Large networks can also use Anycast at their edge and then route traffic internally to specific healthy machines without advertising any change externally via BGP.
Layer 1: Network Layer Failures (BGP)
Finally, how does traffic reliably travel between the different major networks (like ISPs and large tech companies, called Autonomous Systems or ASes) that make up the internet? This is managed by BGP (Border Gateway Protocol).
BGP’s resilience comes from path redundancy. Most networks connect to multiple other networks using various physical links. BGP routers learn and track all these possible paths to destinations across the internet. If one path fails (maybe a cable is cut or a router goes down) BGP routers detect the problem. They then recalculate and choose the best alternative path available. Traffic automatically flows around the failed section. While BGP makes the network resilient, it’s possible, for all paths between two points to fail simultaneously, which would cause a temporary outage for that specific connection.
Summary: Fault Tolerance Comes in Layers
Fault tolerance is built into each layer of the internet. Each layer often includes health checks and mechanisms to handle local failures, contributing to the reliable online experience we depend on.
Further things to explore
-
Health checks themselves are a difficult problem due to unbounded network delays, slow machines, and partial failures.
-
How does IP advertisement work? Can any node advertise any IP and cause disruption to internet infrastructure?
-
One of the most difficult parts of keeping a web application fault tolerant is ensuring that stateful components like databases remain performant and resilient to failure