“The foundation of modern database technology is without question the relational model; it is the foundation that makes the field a science.”
--C. J. Date, AN INTRODUCTION TO DATABASE SYSTEMS
“Over the past decades mainstream economics in universities has become increasingly mathematical, focusing on complex statistical analyses and modeling to the detriment of the observation of reality.”
--J. Luyendijk, Don’t let the Nobel Prize fool you, economics is not a science
Science is the formulation and validation of theories about the real world in the context of discovery (CoD) and context of validation (CoV), respectively. There is "hard" science -- theories about the physical world (physics, chemistry, biology) -- and "soft" science -- theories about human behavior (political, economics, psychology). All science uses data, initially only in the CoV, but increasingly also in the CoD -- computerized discovery of patterns as potential hypotheses (i.e., "data mining").
“3rd normal form data models in data warehousing efforts struggle when changes impact parent child relationships. These impacts cause cascading changes to the data model, the queries, and the loading processes. [For example:]
- There are bank accounts
- Each account belongs to exactly one customer
- A customer can have more than one account
The bank introduces a new product: joint accounts, which means an account can now have more than one owner. It is clear that the 3NF model has to be extended in order to keep this new information; the data vault models seems to be able to fulfill the new requirement.
Some banks propose joint accounts, some don’t, therefore some use M:N relation between client and accounts and others 1:N. A model which is good for any possible case is actually awful model because it describes nothing: by looking at this model you can’t say if joint accounts exist among bank's products.”
--Data Vault and Model (in)Stability
Data warehousing/vault[1] are a red herring here -- the real issue is data independence. Some corrections and clarifications first:
- Normal forms do not pertain to the data model itself -- the RDM -- but to relations in logical models created using strictly the RDM[2].
- 3NF is insufficient -- relations are in 5NF by definition, otherwise correctness is not guaranteed[3].
- The RDM was introduced as a database representation superior to old directed graph -- hierarchic and network (CODASYL) -- systems for conceptual models focused on relationships among entity groups, rather than among individual entities[4]. Graph database representation (nodes and edges) corresponds to a worldview at the conceptual level of parents-children (network) relationships, of which parent-children (hierarchy) is a special case. The relational representation (relations) corresponds to M:N relationships among entity groups, of which M:1 is a special case[5].
Note: Correctness -- logical and semantic[6] -- requires adherence to three principles of database design that jointly imply 5NF[7].
Assume a conceptual model of a multigroup consisting of two related entity groups, Customers and Orders, where a customer can issue multiple orders. The conventional logical database design is:
CUSTOMERS
===============================================
| CID | NAME | AGE | ADDRESS | SALARY |
-=====-----------------------------------------
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | Kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
-----------------------------------------------
ORDERS
===================================
| OID | DATE | CID | AMOUNT |
-=====-----------------------------
| 102 | 2009-10-08 | 3 | 3000 |
| 100 | 2009-10-08 | 3 | 1500 |
| 101 | 2009-11-20 | 2 | 1560 |
| 103 | 2008-05-20 | 4 | 2060 |
-----------------------------------
where ORDERS.CID is an "embedded" foreign key (FK) referencing the primary key (PK) CUSTOMERS.CID.
Consider the query "For all orders, find the CID, name, OID, amount, and date" that applies a join of the two relations on CID. In SQL:
SELECT c.cid,c.name,o.oid,o.amount,o.date
FROM customers c
INNER JOIN orders o
ON c.cid = o.cid;
with the result displayed by the table:
====================================================
| C.CID | C.NAME | O.OID | O.AMOUNT | O.DATE |
-=======------------=======-------------------------
| 2 | Khilan | 101 | 1560 | 2009-11-20 |
| 3 | Kaushik | 102 | 3000 | 2009-10-08 |
| 3 | Kaushik | 100 | 1500 | 2009-10-08 |
| 4 | Chaitali | 103 | 2060 | 2008-05-20 |
----------------------------------------------------
Note: A table is just a tabular display of a relation and the two should not be confused[1,2]. Bear in mind that SQL tables are not relations.
It may surprise you to know that both the design and the result are problematic from a relational standpoint.
by David McGoveran
I have recently posted on LinkedIn two "To Laugh or Cry?" quotes about advantages and disadvantages of "relational" and non-relational" DBMSs from "Choosing The Right Database" @TowardsDataScience.com (by Jun Wu, "a Content Writer for Technology, AI, Data Science, Psychology, and Parenting" with "background in programming and statistics" who "on her spare time, writes poetry and blogs on her parenting website").
I also brought the article to the attention of David McGoveran, who replied "This deserves a response -- it is simplistic, but makes the current, widespread thinking in the industry clear". His comments that follow focus on the three concepts in the title.
Although likely written with good intentions, Jun Wu's article is representative of the muddy thinking that pervades the database and "data science" communities[1,2]. Besides the rather obvious fact that one does not choose a "database" (an organized collection of data), but a DBMS (software that manages databases), it swallows whole and then spreads confused and inconsistent usage of the terms consistency, availability, and partitioning. [FP: That practitioners use the terms database and DBMS interchangeably is part and parcel of the general misuse and abuse of terminology reflecting poor foundation knowledge.]
The terms consistency, availability, and partitioning as used in Brewer's CAP Theorem are not the same terms as those traditionally used in relational database theory and even commercial SQL DBMS practice. For pragmatic reasons I'll qualify the former terms with "CAP" and the latter terms with "RDB", even though I find it misleading to refer to commercial SQL systems as "relational"[3,4,5,6]. Let's look at each term and its intended meaning.
These comments were prompted by a LinkedIn post referencing Michael Stonebraker's Those Who Forget the Past Are Doomed to Repeat It -- something I often reiterate myself -- where he argues:
“Over the past decade, there have been a number of DBMSs introduced (typically labeled as NoSQL) which utilize a network or hierarchical data model. MongoDB and Cassandra come immediately to mind as examples. Some such systems support networks through the concepts of "links" and some support hierarchical data using a nested data model often utilizing JSON. In my opinion, these systems have not internalized lessons from history.
“At the SIGFIDET (now SIGMOD) annual conference in 1974, there was a "Great Debate" over the merits of the relational model versus the network and hierarchical models ... Basically, the argument was about which model [relational or network] was a better fit for structured data (as opposed to documents, e-mails, etc.) and boiled down to two questions:
Question 1: Are high-level data sublanguages a good idea?
Question 2: Are tables the best data structure or should one use a network or hierarchy?”
“The last 45 years have definitely affirmed Codd’s position on both issues ... The conclusion from the 1970s was that the relational model provides superior data independence, compared to the network and hierarchical [graph] models. Forty-five years later, this conclusion is still true. If you want to insulate yourself from the changes that business conditions dictate, use a relational DBMS. If you want the successor to the successor to your job to thank you for your wise decision, use a relational model.”
I couldn't agree more, having repeatedly argued this myself. But he misses some old aspects that the industry has failed to recognize, has ignored, or dismissed[1]; and some important new aspects due to a new understanding of Codd's work[2].
In Parts 1, 2, and 3 we showed that when the RDM is the data model:
- A data sublanguage is short for data manipulation language (DML) that combines (1) a relationally complete retrieval component (i.e., that expresses the RA) with (2) a component that expresses updates as relation transformations;
- A DBMS language is a careful combination, for practical purposes, of the data sublanguage with several sublanguages, each of which expresses a data management function (e.g., data definition, transactions, concurrency, authorizations) -- that are not relational, but are consistent with the RDM, and must not include syntactic elements that are at odds with, or subvert those of the DML.
Note: The RDM is the only data model consistent with Codd's definition that has been formalized [1].
We are now in a position to debunk the two quotes that triggered this series.