09/19/23: For the latest on this subject see: FIRST NORMAL FORM - A DEFINITIVE GUIDE
11/09/22: Revised
Here's what's wrong with last week's picture, namely:
Q: "I'm currently trying to design a database and I'm not too sure about the best way to approach a dynamically sized array field of one of my objects. My first thought is to use a column in my object to store an array of integers. However the more I read, the more I think this isn't the best option. Concrete example wise, I have a player object that stores 0 to many items, which are represented by an integer. What is the best way to represent this?"
A: "If a collection of values is atomic, store them together. Meaning, if you always care about the entire group, if you never search for nested values and never sort by nested values, then they should be stored together as a single field value. If not, they should be stored in a separate table, each value bring a row, each assigned the parent ID (foreign key) of a record on the other table that "owns" them as a group. For more info, search on the term "database normalization".Focus on physical implementation ("dynamically sized array field") without well-defined conceptual and logical features it is supposed to represent ("a player object" is hardly enough) and confusion of levels of representation (a real world object does not "store" anything) are always a red flag, an indication of poor grasp of foundation knowledge. So let's introduce some.
Some databases, support an array as a data type. For example, Postgres allows you to define a column as a one-dimension array, or even a two dimension array. If your database does not support array as a type of column definition, transform you data collection into an XML or JSON support if your database your database supports that type. For example, Postgres has basic support for storing, retrieving, and non-indexed searching of XML using XPath. And Postgres offers excellent industry-leading support for JSON as a data type including indexed support on nested values. Going this XML/JSON route can be an exception to the normalization rules I mentioned above." --StackOverflow.com
--------------------------------------------------------------------------------
I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.
---------------------------------------------------------------------------------
As I explain in the forthcoming first paper [1] in my new "Understanding Codd's True RDM" series, even though the RDM was introduced almost 50 years ago, the current understanding -- such as it is -- is very far from its original intent. One of the least understood features of the RDM as defined by Codd is the "simple domain ... whose elements are atomic (non-decomposable) values" [2].
Value decomposability
Date has argued that because any value can be decomposed, value atomicity is not absolute and cannot be enforced. So, while according to Codd, only a relation with all attributes defined on simple domains with atomic values is in its simplest, normal form -- which today we refer to as 1st normal form (1NF), even though the two are not identical -- Date contends that a relation is by definition in 1NF because its attributes are defined on single-valued domains. Well, yes and no.
Date interprets Codd's atomicity as an inherent property of the data, while we believe that Codd meant a choice by the database designer driven by application needs (in the context of what is possible using relational operators). A designer can define any domain with values of arbitrary complexity -- even images, documents, relations (though see below) -- and as long as it does not have components that are meaningful to applications, they are atomic. For domains
defined to the DBMS as such, if the data sub-language does not allow relational
expressions -- constraints and relational operations -- to reference components thereof, it is in effect rendering their values non-decomposable. This is another way of saying that values are only
values of defined domains, not of undefined domain components [3], as
mandated by the Information Principle (IP), Rule 1 of the famous
12 Codd rules: "All information in a relational database must be
represented explicitly in exactly one way, as values of relation
attributes drawn from domains" (i.e., defined domains). Otherwise, reference to components -- which are
not defined domains -- means creating new domains on-the-fly in violation of the IP.
There is, of course, nothing to prevent atomic, non-decomposable document-, image-, or relation-valued domains, but note that their constraints and operators are built into the domain definition and "hidden" from the DBMS. The answer to the question above has the heart in the right place and cleaned from the logical-physical confusion (LPC) of which it suffers and expressed more rigorously, is: Define complex domains iff applications do not need the DBMS to reference components thereof. Otherwise, define the the components as domains in their own right. Given a design, if applications need access to components of the defined domains, the design is incorrect.
Normalization
Per Codd, a database relation is in 1NF (although we now contend also in 5NF) by definition. While EFC referred to "non-1NF relations", by our interpretation, the data language can enforce atomicity", there are none of those in RDM.
"It is clear to me -- and I agree with Date -- that the term relation must be reserved for those relations that are at least in 1NF. The formal definitions of the relational operators work correctly only with relations in their first normal form." --David McGoveranIf a relation was designed with attributes defined on relation-valued domains (RVDs), there is a simple normalization process that separating their values as relations in their own right, related to the original RVDs via FK-PK referential constraints [2]. But as I reiterated more than once, explicit normalization is necessary only for repair of poor designs. Proper design implicitly produces fully normalized (5NF) relations [4].
Domains that are not RVDs (e.g., XML/JSON-valued) are not easily amenable to normalization. Eliminating them is more complicated -- the information content of the documents' must be modeled such that they can be represented by 1NF relations. But the cost of not doing so may prove higher.
If database relations are in their normal form, first order predicate logic (FOPL) is sufficient for relational data sub-languages, which confers the following significant practical advantages:
- Language decidability;
- Physical independence (PI);
- System-guaranteed logical validity and by-design semantic consistency;
- Simplicity.
Application programs can, of course, use the hosted relational sub-language to retrieve data and process it in whichever way they desire. They can even update values if the new values satisfy the domain definition. But, of course, this defeats the purpose of the RDM: it requires higher logic than FOPL and a computationally complete programming language, the relational benefits are lost and all bets are off in that respect. The reason you never see applications enforce integrity of the elements of complex-valued domains is that it is prohibitive for programmers to analyze and enforce in code; and soundness cannot be system-guaranteed.
So:
- Yes, there is no absolute atomicity;
- No, the atomicity designed into domains is enforcible by the data sub-language;
- Yes, a relation is at least in 1NF by definition, in which case EFC's definition of simple domains is fine.
References
- Pascal, F., Interpretation and Representation of Database Relations, Understanding Codd's Real RDM, paper #1, forthcoming.
- Codd, E. F., A Relational Model of Data for Large Shared Data Banks (LINKS page).
- Pascal, F., First Normal Form in Theory and Practice, Parts 1,2 and 3, www.dbdebunk.com.
- Pascal, F., The Costly Illusion: Normalization, Integrity and Performance, PRACTICAL DATABASE FOUNDATIONS paper #2, www.dbdebunk.com.
"The answer to the question has the heart in the right place" -- I'm sincerely glad you came to the same conclusion as me, at first I feared a hatchet job! You application of foundation knowledge here is superb, as we've come to expect.
ReplyDeleteIf Codd meant "meant a choice by the database designer driven by application needs" then his position suffers from the same mistake that people make when they speak of the colour of an object. An object that appears to be blue under white light will appear to be black under yellow light. A design that will appear to be xNF under application needs X might appear to be 0NF under application needs Y. Properties that depend in any way on ambient conditions are not properties at all, they are illusions of them. And moreover they are not formally assessable unless and until everything that they are a function of is itself formally expressible. In this case, until "application needs" are formally expressible.
ReplyDelete>>All this would be possible with (1) a true RDBMS that supports simple domains (2) a truly relational data sublanguage that does not subvert atomicity
ReplyDeleteAre you aware of any active research going on with respect to the above, in either university or industry setting? Other than what is listed on Date/Darwen TTM site?