digest

to update

1. Why coordination in a distributed system is so challenging

After getting introduced to Apache ZooKeeper and its role in the design and
development of a distributed application, let’s drill down deeper into why
coordination in a distributed system is a hard problem.
Let’s take the example of doing configuration management for a distributed application that comprises
multiple software components running independently and concurrently, spanning across multiple physical servers.
Now, having a master node where the cluster configuration is stored and other worker nodes that download it from this master
node and auto configure themselves seems to be a simple and elegant solution.
However, this solution suffers from a potential problem of the master node being a single point of failure.
Even if we assume that the master node is designed to be fault-tolerant,
designing a system where change in the configuration is propagated to all worker nodes dynamically is not straightforward.

Another coordination problem in a distributed system is service discovery. Often,
to sustain the load and increase the availability of the application, we add more
physical servers to the system. However, we can get the client or worker nodes
to know about this change in the cluster memberships and availability of newer
machines that host different services in the cluster is something. This needs careful
design and implementation of logic in the client application itself.

Scalability improves availability, but it complicates coordination. A horizontally
scalable distributed system that spans over hundreds and thousands of physical
machines is often prone to failures such as hardware faults, system crashes,
communication link failures, and so on. These types of failures don’t really follow
any pattern, and hence, to handle such failures in the application logic and design
the system to be fault-tolerant is truly a difficult problem.

Thus, from what has been noted so far, it’s apparent that architecting a distributed
system is not so simple. Making correct, fast, and scalable cluster coordination is
hard and often prone to errors, thus leading to an overall inconsistency in the cluster.
This is where Apache ZooKeeper comes to the rescue as a robust coordination
service in the design and development of distributed systems.

2.

3.