To my readers and their families: stay safe!
Note: To demonstrate correctness and stability of a sound foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing "oldies but goodies" from the old DBDebunk (2000-06), so that you can judge for yourself how well my then arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may break long pieces into multiple posts, revise, and/or add comments and references.
In Part 1 we re-published a reader's comments on "horizontal decomposition" -- Hugh Darwen's proposal on How to Handle Missing Information without Using NULLs relative to our The Final NULL in the Coffin: A Relational Solution to Missing Data; In Part 2 we re-published Darwen's response. Here's my reply revised for consistency with the current state of knowledge.
Note: To demonstrate the correctness and stability of a sound foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may break long pieces into multiple posts, revise, and/or add comments and references.
In Part 1 we re-published a reader's response to "horizontal decomposition" -- Hugh Darwen's How to Handle Missing Information without Using NULLs -- in comparison to our The Final NULL in the Coffin: A Relational Solution to Missing Data). Here's Hugh's response.
Note: To demonstrate the correctness and stability of a sound foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may break long pieces into multiple posts, revise, and/or add comments and references.
“I'm excited to share a data.world research partnership with Prof Leonid Libkin and Paolo Guagliardo from The University of Edinburgh. Our goal is to understand how NULL values are used in the real word to bridge theory and practice. Please help us by participating in a survey.”
Thus a recent announcement on LinkedIn, which triggered reactions in praise of this "much needed effort".
Sigh! SQL's NULL is a blunder unworthy of research. The commonly used "NULL value" is a contradiction in terms, indicating that industry surveys are not a path to enlightening. The real issue is, of course, missing data, which is governed by long studied and well understood logic[1,2,3,4], though apparently not in the industry and today's academia.
In 2004 we published The Final NULL in the Coffin: A Relational Solution to Missing Data (a paper revised since) that we believe is theoretically sound and, importantly, consistent with McGoveran's work re-interpreting, extending and formalizing Codd's RDM[5]. At the time it generated a series of exchanges with readers, which were posted at the old DBDebunk (2000-2006). In light of the above they warrant re-production.
I start with the first, split in three parts: In this Part 1 a reader's reaction to both our solution and Hugh Darwen's "horizontal decomposition" alternative, How to Handle Missing Information without Using NULLs; Hugh's reply is in Part 2 and mine -- re-written to bring up to date with current state of knowledge and for clarity -- is in Part 3.
Note: In a later book Darwen dedicated a chapter to a "multi-relation" approach which seems an allusion to our solution.
Note: To demonstrate the correctness and stability of database designs provided by a sound foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" post from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions they were intended to dispel (I may break long pieces into multiple posts, and add comments and references).
In response to an online publication of a book appendix regurgitating Codd's 12 famous rules (some of which were, typically, incorrect[1]) I posted earlier a clarification of the rules. This is a revision thereof for better consistency with the new understanding of the RDM based on McGoveran's re-interpretation, extension and formalization[2] of Codd's work.
Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, which is based on the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can acquire the knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).
“The relational calculus is good in describing sets. But it´s bad at describing relations between data in different sets. Explicit identities (primary keys) need to be introduced and normalization is needed to avoid update inconsistencies due to duplication of data. To say it somewhat bluntly: The problem with the relational calculus and RDBMS etc. is the focus on data. It´s seems to be so important to store the data, that connecting the data moves to the background. That might be close to how we store filled in paper forms. But it´s so unlike how the mind works. There is no data stored in your brain. If you look at the fridge in your kitchen, there is no tiny fridge created in your brain so you can take the memory of your fridge with you, when you leave your kitchen.” --Weblogs.asp.net
The lack of foundation knowledge exposed by the above paragraph is so complete that its claims are practically upside down and backwards.
As
we have demonstrated, in mathematical set theory a relation (set) is a
subset of a cross-product of domains (sets). In other words, it is a set
that is a relationship among sets. Being abstract (i.e., having no
real world meaning), the values of mathematical relations can be
arbitrary.
The RDM is an application of simple set theory
expressible in first order predicate logic (SST/FOPL) to database
management: a relational database represents a conceptual model of some
reality, namely (facts about) a multigroup in the real world -- a
collection of related entity groups -- each database relation
representing one such group; a database is also a set of related relations. The values in database relations (i.e., the
data) are, thus, not arbitrary, but must be consistent with the conceptual
model: relations and the database as a whole are semantically
constrainted to be so consistent: (1) individual properties of entities
and (2) collective properties of (a) groups (i.e., relationships among
entities within groups), and (b) the multigroup (i.e., relationships
among groups).
A primary key (PK) represents names given in the
real world to entities of a given type, and the corresponding PK
constraint (uniqueness) enforces consistency of a relation with the
distinguishability of those entities in the real world, the facts about which it represents. These are not
RDM artifacts, but rather part of the adaptation of SST/FOPL to database management.
For the primary advantage of
the RDM -- guaranteed correctness of query results (i.e., inferences
made from the database) -- to materialize, logical database design must
adhere to three core principles which, jointly, imply fully normalized
relations (5NF). In fact, in RDM relations are in 5NF by definition,
otherwise they are not relations -- relational algebra (RA) operations lose information and
all bets are off.
The RA is the manipulative
component of the RDM -- a collection of primitive and derived set
operations on relations that describe
relationships among relations. For example, the join operation r1 JOIN
r2 describes a relationship between r1 and r2 relation, the result itself a relation. Note that since every result of a RA operation on even one relation is always a relation and still describes a relationship -- between the "input" and "output" relations.
A data model -- and, industry claims notwithstanding,
the only one satisfying Codd's definition that has been formalized is the
RDM -- is by nature focused on data. However, the RDM supports
physical independence (PI) and, thus, not concerned with how data is
physically stored and accessed. The notion of "files stored in paper
form" is an example of the common and entrenched logical-physical
confusion (LPC) due to failure to understand the distinction between a
logical relation and its tabular visualization on a physical medium,
induced/reinforced by the industry's "direct image" implementation of
SQL DBMSs.
We rephrase the above paragraph as follows:
“The relational algebra describes relationships among relations (sets). Primary keys are one of the adaptations of the SST/FOPL for database management: a PK constraint -- uniqueness -- represents formally in the database a within-group relationship among all its entities.
Mandatory adherence to three core design principles jointly imply full normalization, which is necessary to guarantees correctness of query results. True RDBMSs:
- Implement the RA for logical data retrieval independent of how the data is physically stored and accessed. SQL DBMSs notwithstanding, vendors are free to store data whichever way they want as long as they don't expose it to users in applications.
- Enforce relational constraints that are formal database representations of relationships in the conceptual model represented by the database.”
The "brain" stuff is sheer nonsense.