The Return of ACID in the Design of NoSQL Databases



The Return of ACID in the Design of NoSQL Databases

Questions about the CAP Theorem

The CAP theorem was inspired by a conjecture of Brewer [1] and proven by Gilbert and Lynch [2]. In its popular formulation, it states that a distributed system cannot simultaneously provide Consistency, Availability, and Partition-tolerance. Give the three properties (C, A, and P), a system can at best support two of the three.

Within the NoSQL community, the CAP theorem has been taken to provide persuasive justification for the adoption of weaker models of consistency. In particular, eventual consistency has been widely adopted, becoming something of a default for NoSQL design. Eventual consistency guarantees only that, given a sufficient period of time during which there are no writes, all previous writes will propagate through the system so that all replicated copies of the data will be consistent. This design is often lauded for allowing low latency and high availability in distributed systems on the assumption that stronger consistency is simply not attainable.

However, as the developer community has gained more experience with the use of NoSQL databases, questions have arisen regarding both the technical motivations and consequences of weak consistency.

As Gilbert and Lynch note, their proof relies on an asynchronous model in which "there is no clock, and nodes must make decisions based only on the messages received and local computation." This model is far more restrictive than actual networks, which "are not purely asynchronous. If you allow each node in the network to have a clock, it is possible to build a more powerful service" with stronger consistency.

In a recent retrospective on the CAP theorem and its influence on distributed database design, Brewer [3] notes that "CAP prohibits only a tiny part of the design space: perfect availability and consistency in the presence of partitions, which are rare." In fact, "because partitions are rare, there is little reason to forfeit C or A when the system is not partitioned." Yet systems that adopt eventual consistency forfeit C during all concurrent operations, not just when there are partitions.

The Weakness of Eventual Consistency

Eventually consistent systems can return stale data, i.e., data values that are not the most recently written. Moreover, the degree of staleness is unbounded in the sense that there is no constant time interval within which the system can guarantee that consistency will be restored. Rather, the degree of staleness is a complex probabilistic function based on a number of system parameters.

In an attempt to "quantify the degree to which eventual consistency is both eventual and consistent," Bailis et al. derive formulas for the probability of reading stale or inconsistent data in an eventually consistent system [4]. In particular, they provide probabilistic bounds on staleness as measured by either elapsed time or the number of elapsed versions. Such probabilistic bounds underscore the observation that developers employing such systems must account for and handle the possibility of inconsistent data.

Nevertheless, might there not be use cases for which the benefits of eventual consistency are worth its costs? Indeed, many discussions of eventual consistency motivate its adoption by its compatibility with a maximal degree of availability. For those systems that require such maximal availability, it is interesting to examine more closely the level of consistency that can be achieved.

Mahajan et al. [5] define an "always-available" system as one in which nodes never refuse reads or writes (as a transactional system might in the case of transaction failure). They revisit Gilbert and Lynch's work and strengthen its result by employing an asynchronous model allowing local clocks, but no clock globally visible to all nodes. In this context, they define a consistency model called Real Time Causal (RTC) consistency that is strictly stronger than eventual consistency. They prove that RTC consistency is a "tight bound" on the consistency achievable in an always-available system, in the sense that RTC consistency can be provided in such a system, and no stronger consistency model can be. Given this result, it is unclear that eventual consistency remains an attractive alternative even for systems that must be always-available.

Google Returns to Transactions

Brewer [3] highlights "the hidden cost of forfeiting consistency, which is the need to know the system's invariants. The subtle beauty of a consistent system is that the invariants tend to hold even when the designer does not know what they are." In other words, when an application requires concurrent operations, weakened consistency creates a "technical debt" that is pushed onto the developers making use of the data store.

Google has experienced this technical debt and responded by returning to transactions in three major systems, Percolator, Megastore, and, most recently, Spanner. This shift to transactions is significant in that Google's Bigtable [6], which lacks transactions, was the inspiration and model for many other NoSQL databases that adopt eventual consistency.

Percolator performs the core function around which Google's business was founded, indexing the web in support of search. Peng and Dabek [7] explain that they "built Percolator to create Google's large 'base' index, a task previously performed by MapReduce." The decision to employ transactions was directly motivated by the limitations of Bigtable. "Distributed storage systems like Bigtable can scale to the size of our repository but don't provide tools to help programmers maintain data invariants in the face of concurrent updates." In contrast, transactions "make it more tractable for the user to reason about the state of the system and to avoid the introduction of errors into a long-lived repository."

Percolator was built on top of Bigtable. It "stores its locks in special in-memory columns in the same Bigtable that stores data and reads or modifies the locks in a Bigtable row transaction when accessing data in that row." While Percolator was deemed successful in meeting its design goals, Peng and Dabek note that the decision to implement the system on top of Bigtable imposed a performance disadvantage. "The conventional wisdom on implementing databases is to 'get close to the iron' and use hardware as directly as possible since even operating system structures like disk caches and schedulers make it hard to implement an efficient database. In Percolator we not only interposed an operating system between our database and the hardware, but also several layers of software and network links. The conventional wisdom is correct: this arrangement has a cost."


Read full article from The Return of ACID in the Design of NoSQL Databases


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts