DATABASE DEBUNKINGS

Friday, November 8, 2019

The CAP Theorem and the RDM: Consistency, Availability, Partitioning

by David McGoveran

I have recently posted on LinkedIn two "To Laugh or Cry?" quotes about advantages and disadvantages of "relational" and non-relational" DBMSs from "Choosing The Right Database" @TowardsDataScience.com (by Jun Wu, "a Content Writer for Technology, AI, Data Science, Psychology, and Parenting" with "background in programming and statistics" who "on her spare time, writes poetry and blogs on her parenting website").

I also brought the article to the attention of David McGoveran, who replied "This deserves a response -- it is simplistic, but makes the current, widespread thinking in the industry clear". His comments that follow focus on the three concepts in the title.

Although likely written with good intentions, Jun Wu's article is representative of the muddy thinking that pervades the database and "data science" communities[1,2]. Besides the rather obvious fact that one does not choose a "database" (an organized collection of data), but a DBMS (software that manages databases), it swallows whole and then spreads confused and inconsistent usage of the terms consistency, availability, and partitioning. [FP: That practitioners use the terms database and DBMS interchangeably is part and parcel of the general misuse and abuse of terminology reflecting poor foundation knowledge.]

The terms consistency, availability, and partitioning as used in Brewer's CAP Theorem are not the same terms as those traditionally used in relational database theory and even commercial SQL DBMS practice. For pragmatic reasons I'll qualify the former terms with "CAP" and the latter terms with "RDB", even though I find it misleading to refer to commercial SQL systems as "relational"[3,4,5,6]. Let's look at each term and its intended meaning.

Comments on a Stonebraker Article

Follow @DBDebunk Follow @ThePostWest

These comments were prompted by a LinkedIn post referencing Michael Stonebraker's Those Who Forget the Past Are Doomed to Repeat It -- something I often reiterate myself -- where he argues:

“Over the past decade, there have been a number of DBMSs introduced (typically labeled as NoSQL) which utilize a network or hierarchical data model. MongoDB and Cassandra come immediately to mind as examples. Some such systems support networks through the concepts of "links" and some support hierarchical data using a nested data model often utilizing JSON. In my opinion, these systems have not internalized lessons from history.

“At the SIGFIDET (now SIGMOD) annual conference in 1974, there was a "Great Debate" over the merits of the relational model versus the network and hierarchical models ... Basically, the argument was about which model [relational or network] was a better fit for structured data (as opposed to documents, e-mails, etc.) and boiled down to two questions:

Question 1: Are high-level data sublanguages a good idea?
Question 2: Are tables the best data structure or should one use a network or hierarchy?”

“The last 45 years have definitely affirmed Codd’s position on both issues ... The conclusion from the 1970s was that the relational model provides superior data independence, compared to the network and hierarchical [graph] models. Forty-five years later, this conclusion is still true. If you want to insulate yourself from the changes that business conditions dictate, use a relational DBMS. If you want the successor to the successor to your job to thank you for your wise decision, use a relational model.”

I couldn't agree more, having repeatedly argued this myself. But he misses some old aspects that the industry has failed to recognize, has ignored, or dismissed[1]; and some important new aspects due to a new understanding of Codd's work[2].

Data Sublanguage Part 4: Conclusion

Follow @DBDebunk Follow @ThePostWest

In Parts 1, 2, and 3 we showed that when the RDM is the data model:

A data sublanguage is short for data manipulation language (DML) that combines (1) a relationally complete retrieval component (i.e., that expresses the RA) with (2) a component that expresses updates as relation transformations;
A DBMS language is a careful combination, for practical purposes, of the data sublanguage with several sublanguages, each of which expresses a data management function (e.g., data definition, transactions, concurrency, authorizations) -- that are not relational, but are consistent with the RDM, and must not include syntactic elements that are at odds with, or subvert those of the DML.

Note: The RDM is the only data model consistent with Codd's definition that has been formalized [1].

We are now in a position to debunk the two quotes that triggered this series.

Brother, Spare Me the "Paradigms"

Follow @DBDebunk Follow @ThePostWest

Note: This is a revised version of an old column @All Analytics in response to a recent LinkedIn exchange (check out my comments in the exchange).

“Consider dimensional design and Big Data as two additional paradigms ... Big Data paradigms like Hadoop and NoSQL will alleviate the temptation people have to try to use the relational database in unnatural ways.”

Every few years (and the intervals are getting shorter) a "fundamentally different" new way of doing data management -- a "paradigm shift" -- is being promoted that, if you don't adopt, you’ll be "left behind". In the above mentioned online exchange it is argued that data management is undergoing a paradigm shift from application-centric to data-centric data management. For the very few who (1) understand what a paradigm is and (2) are familiar with data fundamentals and the history of the field, the irony could not be richer.

Testing Your Foundation Knowledge

Follow @DBDebunk Follow @ThePostWest

The Web is chockful of unnoticed/unquestioned pronouncements by novices or self-taught "experts", that are (1) wrong, or (2) gobbledygook. Attempts to demonstrate the lack of foundation knowledge underlying these misconceptions are usually dismissed as "theory, not practical", attacked as "insulting ad-hominem", or ignored altogether, regardless of the amount and quality of supporting evidence and logic. Practitioners who cannot discern such misconceptions and understand their practical implications are insufficiently prepared for a professional career in data management. They cannot associate problems with their real causes and come up with proper solutions, which explains the industry's "cookbook approach" and succession of fads. What about you?

This is another batch in the Test Your Foundation Knowledge of online statements reflecting common misconceptions due to lack of foundation knowledge. Can you debunk them? Which of the two categories, (1) or (2), do they fall in? If not, check out the recommended references.

Comment: The kind of clueless exchange leading nowhere that takes place in the absence of foundation knowledge. See if you learned anything from it, then check out the references.

Data Sublanguage Part 3: DBMS Language

Follow @DBDebunk Follow @ThePostWest

Note: 10/10/2019 significantly revised Part 2, which requires a re-read.

A formal data model consists of structure, integrity and manipulation[1,2] and so requires (1) a language that expresses data manipulation (retrievals and updates) augmented with (2) a metalanguage used to define the model's structural elements. As we showed in Part 1 and Part 2, when the data model is the RDM:

Data sublanguage is short for a relationally complete data manipulation sublanguage (DML) that expresses retrievals and updates, the latter correctly understood as set-theoretic relation transformations.
A data definition sublanguage (DDL) is a metalanguage for DML that is outside the theory but consistent with the RDM and at least as powerful expressively as the DML (e.g., a very carefully restricted SOL to avoid self-referencing).
The DML and the DDL can, for practical purposes, be carefully unified into what Codd called a "comprehensive data sublanguage", but we prefer DBMS language to avoid confusion.

By carefully we mean that because only the DML is, strictly speaking, based on relational theory, the DDL must be consistent with, but differentiated from it, such that the two can't be mixed in the same expression in a way that the former subverts the latter. A DML expression can be referenced as a sub-expression by a DDL expression (e.g., as in view definitions), but only if it contains strictly retrievals (e.g., SELECT) and no updates (e.g., INSERT, UPDATE, DELETE).

Note: Data definition and manipulation are possible without a DBMS. However:

“With a relational catalog, definition can be performed via the RA, which requires physical implementation to be determined exclusively by the catalog (behind the scenes as it were) -- a kind of skeletal, primitive, or rudimentary DBMS. This is why Codd created a relational catalog that contains a description of the database and could be managed using RA-based DML. It works well unless one is allowed to mix DDL (metalanguage) with DML (language) in the same expression. Otherwise put, the database can be read to modify the catalog, but not vice-versa (as far as the DML is concerned, the catalog that describes the database does not exist).

But with a data model that, unlike the RDM, does not define a catalog such that the same language can be used for both database and it, a rudimentary DBMS must provide a workaround, and if the model is computationally complete (like CODASYL was), there must limits on how "active" the catalog is to prevent users from writing self-referencing expressions that cannot be automatically implemented because they may corrupt the database (same as would mixing data sublanguage and host language). This is one reason some of the pre-RDM directed graph DBMSs had limited notions of catalog that often required completely separate facilities to maintain.”

--David McGoveran

The DML and the DDL express two core data management functions centralized in the DBMS. When the data model is the RDM, only data management functions are permitted to access the data. At the end of Part 2 we alluded to other such functions that, like data definition, are outside relational theory, but must be consistent with the RDM.

Testing Your Foundation Knowledge

Follow @DBDebunk Follow @ThePostWest

The Web is chockful of unnoticed/unquestioned pronouncements by novices or "experts", many self-taught, that are (1) wrong, or (2) gobbledygook. Any attempt to demonstrate lack of foundation knowledge underlying these misconceptions and their practical implications are usually dismissed as "theory, not practical", attacked as "insulting ad-hominem", or ignored altogether, regardless of the amount and quality of the supporting evidence and argument logic. This is understandable: in the absence of foundation knowledge and ability to reason, it is by definition impossible to comprehend and appreciate corrections that require them.

Practitioners who cannot detect such misconceptions and understand their practical implications and the importance thereof are insufficiently prepared for a professional career in data management. Worse, they cannot associate problems with their real causes and, thus, cannot come up with proper solutions, which explains the industry's "cookbook approach" and succession of fads.

What about you? This is another batch in the Test Your Foundation Knowledge regular series of posts of online statements reflecting common misconceptions that are difficult to discern without foundation knowledge. You can test yours by trying to debunk them in Comments, including what category, (1) or (2) do they fall in? If you can't, proper education is in order.

POSTS

Friday, November 8, 2019

The CAP Theorem and the RDM: Consistency, Availability, Partitioning

by David McGoveran

Friday, November 1, 2019

Comments on a Stonebraker Article

Saturday, October 26, 2019

Data Sublanguage Part 4: Conclusion

Saturday, October 19, 2019

Brother, Spare Me the "Paradigms"

Friday, October 18, 2019

Testing Your Foundation Knowledge

Friday, October 11, 2019

Data Sublanguage Part 3: DBMS Language

Friday, October 4, 2019

Testing Your Foundation Knowledge