Introduction

CAP stands for Consistency, Availability and Partition tolerance. The CAP theorem first appeared in autumn 1998 and published as the CAP principle in 1999 also named Brewer’s theorem after computer scientist Eric Brewer states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:

  • Consistency: Every read receives the most recent write or an error
  • Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
  • Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

Fallacies of Distributed Systems

The distributed systems are heavily dependent on various moving parts connected over a network and heavily rely on partitioning to achieve the scalability, and be informed that network failure is inevitable http://www.mypassionfor.net/2020/03/11/8-fallacies-of-distributed-computing/

 Image Credit – Denise Yu Twitter: https://lnkd.in/eJYnpC

Can a system have all CAP?

The only exception which can practically demonstrate all CAP would be a single node system, which can’t be regarded as a distributed system. Moreover, it can’t scale to the level of what a distributed system is capable of and designed with a goal in mind. 

Note: Single node can utilize only Scale-Up or Vertical-Scaling strategy. Any system which is using Scale-Out or Horizontal-Scaling has to abide by the CAP theorem.

Explanation

No distributed system is safe from network failures, thus network partitioning has to be tolerated. In the presence of a partition, one is then left with two options: consistency or availability. When choosing consistency over availability, the system will return an error or a time out if particular information cannot be guaranteed to be up to date due to network partitioning.

When choosing availability over consistency, the system will always process the query and try to return the most recent available version of the information, even if it cannot guarantee it is up to date due to network partitioning (which is better in many cases for user experience). When a network connection is established, consistency will be achieved again across nodes/partitions i.e. eventually consistent.

Database systems designed with traditional ACID guarantees in mind such as RDBMS choose consistency over availability, whereas systems designed around the BASE philosophy, common in the NoSQL databases, for example, choose availability over consistency.

CAP in Practice

Let’s say you are designing a distributed system that has a cluster of 4 data nodes. Replication factor is 2 i.e. any data written in the cluster must be written on 2 nodes; so when one goes down – second can serve the data.

https://www.linuxtopia.org/

Now try to apply CAP theorem on this requirement. Remember, in a distributed system, two things may happen anytime i.e. node failure (hard disk crash, EC2/VM failure) or network failure (the connection between two nodes go down) [Fallacy of distributed computing].

CP [Consistency/Partition Tolerance] Systems

Distributed systems ensure data consistency at the time of reading the data, by a voting mechanism, where all nodes who have a copy of data mutually agree that they have “same copy” of requested data. Now let’s assume that requested data is present in two nodes A-1 and A-2. The client tries to read the data; and our CP system is partition tolerant as well, so an expected network failure occurred and A-1 is detected as down. Now the system cannot determine that A-2’s data copy is the latest or not; it may be stale as well. So the system decides to send an error to the client.

Here system chose to prefer data consistency over data availability. Similarly, at the time of writing the data if the replication factor is 2, then the system may reject write requests until it finds two healthy nodes to write data fully in consistent manner. Hence, now the system is considered not “Available” i.e CAP’s “A” is sacrificed.

CP system is harder and expensive to implement.

AP [Availability/Partition Tolerance] Systems

In today’s customer-obsessed mindset, a system is expected to up-and-running 24×7 hence, you have to put proper thinking in place and make informed decisions while choosing trade-offs. As with CP, the system is known not to be “Available” in the event of a failure.

A resilient system design would rather consider instead of sending ERROR (in case A-1 is down); it sends the data received from A-2. Chances are that the data client will get may not be the latest data (hard to decide). Hence, now the system is considered “Available” i.e CAP’s “C” is sacrificed.

CA [Consistency/Availability] Systems, Really?

In a distributed environment, we cannot avoid “P” of CAP. So we have to choose between CP or AP systems. If we desire to have a consistent and available system, then we must forget about partition tolerance and it’s possible only in non-distributed systems such as an RDBMS system.

No alt text provided for this image

Closing Note

CAP is frequently misunderstood as if one has to choose to abandon one of the three guarantees at all times. In fact, the choice is really between consistency and availability only when a network partition or failure happens; at all other times, no trade-off has to be made.

CAP is to choose your tradeoff in the event of a failure (network/node). In the absence of network failure – that is, when the distributed system is running normally – both availability and consistency can be satisfied.

In today’s world, we can achieve all 3 in a distributed system (if not fully, then partially at least). E.g. Tunable consistency in Cassandra

Abstract

Those days are gone when an enterprise was happily serving its customers by just a handful of on-prem servers, delivering content within seconds to minutes, thinking hours of downtime for maintenance and deployment is fine and data consumption is in gigabytes.

Today’s distributed systems are meant to serve an ever-expanding number of customers “Anytime, Anywhere, on Any Device“, customers expecting a response in milliseconds and 100% uptime. Users expect millisecond response times with 100% uptime, and data is measured in Petabytes.

Today’s demands are simply not met by yesterday’s software architectures.

To meet such enlarged expectations, our distributed systems must be Responsive, Resilient, Elastic and Message Driven, and this is called a Reactive System. Systems built as Reactive Systems are more flexible, loosely-coupled and scalable. They are significantly more tolerant of failure and when failure does occur they meet it with elegance rather than a disaster.

Four Reactive Principles

Responsive: The system responds in a timely manner if at all possible. Responsiveness is the cornerstone of usability and ensuring early problem detection and resolution. Responsive systems focus on providing rapid and consistent response times and delivering consistent quality of service. This consistent behavior, in turn, simplifies error handling, builds end-user confidence, encourages and promotes continued interaction with the software application.

Resilient: The system stays responsive in the face of failure. Any system that is not resilient will be unresponsive after a failure. Microservice Architecture lays the foundation for building a resilient system. Resilience is achieved by isolation, replication, and containment, Failures are contained within each component, isolating components from each other and thereby ensuring that parts of the system can fail and self-heal without compromising the system as a whole.

Elastic: The system stays responsive under varying workloads and increased demands. Reactive Systems can react to changes in the input rate by increasing or decreasing the resources allocated and support the business continuity. This implies designs that have no contention points or central bottlenecks, resulting in the ability to shard or replicate components and distribute inputs among them.

Message Driven: The system relies on asynchronous message-passing to ensures loose coupling, isolation, location transparency, and provides the means to delegate errors as messages. Asynchronous messaging supports nonblocking communication which allows recipients to only consume resources while active, leading to less overhead on the system due to non-blocking I/O.

Seattle Code Camp 2018

September 19th, 2018 | Posted by Vidya Vrat in Architecture | Community - (0 Comments)

This was 3rd years in a row, and on Saturday, September 15 I gave two sessions at Seattle Code Camp 2018

Session slides are available at my GitHub 

The first session was on “How to become a Software Architect”. Many aspirants and practicing architects attended my session and asked various questions on architect career path.

The second session was on “Microservices: what, why and how?”. This session also went very well, and I ran out of time to address all the questions related to many individual’s architecture designs and how to solve a problem they are facing. Hence, few topics were discussed after the session in the lobby.

What some of the participants said

I look forward to Seattle Code Camp 2019.