DATABASE DEBUNKINGS

Sunday, February 18, 2018

This Week

1. Database Truth

"The RDM is a formal system. It has two parts. Semantics its outside the formal language (which is Deductive Subsystem), but not outside the interpretation (i.e., application) of that language (Interpretation Subsystem). Without an Interpretation Subsystem there is no possibility of applying the formal system and it remains an abstract game of symbols."

"Codd's 1979 paper described a way to "capture" semantics using the relational formalism. That formalism doesn't tell you how to discover semantics, but if you have them, then he showed (at least to some degree) how to express those semantics relationally."

"Semantics is about applying the RDM to some subject. In effect, what you do is restrict the power of the abstract formalism so that it is more closely aligned with your intended use. In my terminology, that means you:

Create axioms (expressed as constraints), limiting the vocabulary to the subject matter (and making it finite and usually fairly small); and,

Restrict the possible interpretations that can be used consistently with the resulting subset of the formalism."

--David McGoveran

2. Do You Know What's Wrong With This Picture?

"When someone refers to a relation in a database course, what does that mean?"

"It means that it is time to go to Wikipedia."

"A relation in the context of modeling a problem will include the fields and possibly the identification of fields which have relationships with other relations."

"A relation is a data structure which consists of a heading and an unordered set of tuples which share the same type."

"I was trying to show that, in SQL, a relation is more than just a table. Queries return relations. And within a query, relational math is happening, with many intermediate results, that themselves are relations."

"A relation is an abstract structure which contains a set of attributes, and a relvar is the dataset status in a particular moment of this relation. The first one can be considered as the table definition with columns, and the second one is dataset in this table."

"Tuples need not have a key (or any way of locating them?) Having tried to answer this question so that I could explain it to my students, I am forced to the conclusion that the theory has nothing whatsoever to do with "data" in the usual sense. Perhaps information Theory would have been a better basis for data systems, rather than Mathematics? Computer applications are notably different from Physics, which math was created to model, and its child, Engineering. I think data was never intended to be "true", it must be useful."

--What is a relation in database terminology?, StackOverflow.com

The Key to Relational Keys - A New Understanding

Follow @DBDebunk Follow @ThePostWest

Version 3 of paper #4 in the PRACTICAL DATABASE FOUNDATION series is now available to order here. The Key To Relational Keys: A New Understanding is a completely new, re-written version that incorporates David McGoveran's formalization and interpretation of Codd's true RDM and introduces a new perspective. It is distinct from and supersedes all previous versions.

Abstract

The dual theoretical foundation of the RDM — simple set theory (SST) expressible in first order predicate logic (FOPL) — is applicable to database management because it is a theory of unique objects and objects are unique in the real world that databases represent. In the real world objects are uniquely identified by (1) a combination of one or more defining properties and/or (2) names assigned to them as members of various groups. A database relation is a formal representation of an object group, with tuples representing (facts about) and attributes representing properties and/or names of object members. Relational keys are attributes that represent formally in the database those identifying properties and names.

Their fundamental database role notwithstanding, relational keys are poorly understood. 70% of hits @dbdebunk.com are about keys and misconceptions about their necessity, what kinds of keys are relational, their functions, their selection, and so on, abound.
This paper defines and explains:

The relational key concept
The kinds of relational keys, their properties and functions
The formal PK mandate
PK selection
RDBMS key support

and debunks common misconceptions.

This Week

Follow @DBDebunk Follow @ThePostWest

1. Database truth of the week

"Relvars introduce a concept of assignment, which has no counterpart in either FOPL or set theory. If you add it to those formalisms you introduce computational completeness, which destroys both decidability (the existence of a general algorithm by which you can determine if an expression is or is not logically valid) and the guarantee that there exists a (query) evaluation procedure that will halt (the existence of a general algorithm by which you can evaluate the truth or falsity of every instantiated predicate expression given those instantiations from any given database). Therefore we must forbid relvars." --David McGoveran

2. What's wrong with this database picture?

"Many data and information modelers talk about all kinds of keys (or identifiers. I'll forego the distinction for now). I hear them talk about primary keys, alternate keys, surrogate keys, technical keys, functional keys, intelligent keys, business keys (for a Data Vault), human keys, natural keys, artificial keys, composite keys, warehouse keys or Dimensional Keys (or Data Warehousing) and whatnot. Then a debate rises on the use (and misuse) of all these keys ... The foremost question we should actually ask ourselves: can we formally disambiguate kinds of keys (at all)? Of all kinds of key, the primary key and the surrogate key gained the most discussion."

"If we take a look at the relational model we only see of one or more attributes that are unique for each tuple in a relation -- no other formal distinction is possible. When we talk about different kinds of keys we base our nomenclature on properties and behavior of the candidate keys. We formally do not have a primary key, it is a choice we make and as such we might treat this key slightly different from all other available keys in a relation. The discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys. --Martijn Evers, Kinds of Keys: On the Nature of Key Classifications, dm-unseen.blogspot.com

How to Think (and Not to Think) During Database Design

Follow @DBDebunk Follow @ThePostWest

"I have to maintain some lists in DB (SQLServer, Oracle, DB2, Derby), I have 2 options to design underlying simple table:

"1st:

NAME   VALUE
=================
dept   HR
dept   fin
role   engineer
role   designer
-----------------

UNIQUE CONSTRAINT (NAME, VALUE) and some other columns like auto generated ID, etc.

"2nd:

NAME VALUE_JSON_CLOB
==================================
dept {["HR", "fin"]}
role {["engineer", "designer"}]
----------------------------------

UNIQUE CONSTRAINT (NAME) and some other columns like auto generated ID, etc.

"There is no DELETE operation, only SELECT and INSERT/UPDATE. In first advantage is only INSERT is required but SELECT (fetch all values for a given NAME) will be slow. In second SELECT will be fast but UPDATE will be slow. By considering there could be 10000s of such lists with 1000s for possible values in the system with frequent SELECTs and less INSERTs, which TABLE design will be good in terms of select/insert/update performance." --SQL TABLE to store lists of strings, StackOverflow.com

Using a relational database to "maintain lists" probably does not merit attention and I actually considered canceling the debunking of this example. But it provides an opportunity to demonstrate the gap between conventional wisdom, database practice and SQL DBMSs and Codd's true RDM, as formalized and interpreted by McGoveran [1]. Such use is induced by lack of foundation knowledge, so for the purpose of this discussion I treat the example as a case of "how not to think when performing database design".

Note: Certainly logical database design should not be contaminated with physical implementation considerations such as performance [2].

This Week and The End of Empire

Follow @DBDebunk Follow @ThePostWest

1. Database truth of the week

"ALL names are human created, either by non-algorithmic assignment, or via some algorithm. We ONLY know that two types of objects are distinct because they have different sets of defining properties and, for a given object type, we ONLY know that two objects are distinct because the values (observed or measured) of that object type's defining properties are distinct. Names (of objects of some type) allow us to distinguish two such entities ONLY when they are 1:1 with the values of the object defining properties. Two sets of names (whether human assigned or machine generated) consistently identify the same set of entities ONLY when they are 1:1." --David McGoveran

2. What's wrong with this database picture?

"I have to maintain some lists in DB (SQLServer, Oracle, DB2, Derby), I have 2 options to design underlying simple table:

"1st:

NAME   VALUE
=================
dept   HR
dept   fin
role   engineer
role   designer
-----------------

UNIQUE CONSTRAINT (NAME, VALUE) and some other columns like auto generated ID, etc.

"2nd:

NAME VALUE_JSON_CLOB
==================================
dept {["HR", "fin"]}
role {["engineer", "designer"}]
----------------------------------

UNIQUE CONSTRAINT (NAME) and some other columns like auto generated ID, etc.

"There is no DELETE operation, only SELECT and INSERT/UPDATE. In first advantage is only INSERT is required but SELECT (fetch all values for a given NAME) will be slow. In second SELECT will be fast but UPDATE will be slow. By considering there could be 10000s of such lists with 1000s for possible values in the system with frequent SELECTs and less INSERTs, which TABLE design will be good in terms of select/insert/update performance." --SQL TABLE to store lists of strings, StackOverflow.com

DBMS for Analytics: Risky Business Without Foundation Knowledge, Part 2

Follow @DBDebunk Follow @ThePostWest

Note: This was originally posted at AllAnalytics, which no longer exists, so some links to other posts there no longer work, but I left them in to alert the reader that I have written on those specific subjects. Other links work.

In this two-part series I alert analysts that correct interpretation and assessment of media/industry claims without being misled requires a good grasp of data fundamentals. In Part 1, I discussed the logical-physical confusion and the erroneous missclassification of DBMSs as relational and non-relational underlying the argument that the latter are superior to the former for analytics applications. In Part 2, I discuss a third misconception behind the claim.

Re-writes (with Season's Greetings)

Follow @DBDebunk Follow @ThePostWest

The following posts have been re-written to bring in line with the McGoveran formalization and interpretation of Codd's true RDM. Re-reading is strongly recommended.

"But the core Information Principle (IP) of the RDM mandates that all information in a relational database be represented explicitly and in exactly one way -- as values of relation attributes defined on domains. The difference between relation names is, thus, meaningful information, the representation of which violates the IP and the RDM, for which reason it is inaccessible to the DBMS: consider the candidate tuple {v1,v2} -- it is impossible for the DBMS to know to which relation it belongs based on the relation and attribute names because it does not understand semantics!" Database Design Relation Predicates and “Identical Relations”

"Some set defining properties are formed as the disjunction of two or more properties (a kind of relationship between two common properties). These disjuncts, taken together, are meaning criteria. Each meaning criterion (an individual disjunct) induces a partitioning of a set into two subsets, those that meet the criterion and those that do not. Alternatively, we can say that each meaning criterion serves to differentiate a possible subset of a set from other subsets of the set (some of the possible subsets will be disjoint, while others not). Each of the possible subsets of the set is then defined by (“inherits”): The defining properties of the set conjoined with at least one meaning criterion (that or those becoming the defining property, or properties, respectively, specific to the proper subset)." Meaning Criteria and Entity Supertype-Subtypes

"Although they are no longer used, inquiries about them persist and with the current proliferation of non-relational products (e.g., NoSQL, graph DBMSs) there is value in understanding them. The closest the industry came to implementing the RDM is SQL which, despite its poor relational fidelity, proved much superior relative to the complexity and inflexibility of preceding DBMSs. But the rules still expose poor relational fidelity of SQL DBMS's that have not been addressed for four decades, while new RDM violations were introduced.

We offer here our clarifications on the rules. For each rule, we:

Explain its intended objective;

Offer clarifications, some of which reflect our current understanding of the RDM -- distinct from conventional wisdom -- based on its dual theoretical foundation and a careful analysis of Codd's work;" --Interpreting Codd's 12 Rules

POSTS

Sunday, February 18, 2018

This Week

1. Database Truth

2. Do You Know What's Wrong With This Picture?

Sunday, February 11, 2018

The Key to Relational Keys - A New Understanding

Abstract

Sunday, January 28, 2018

This Week

1. Database truth of the week

2. What's wrong with this database picture?

Sunday, January 21, 2018

How to Think (and Not to Think) During Database Design

Monday, January 15, 2018

This Week and The End of Empire

1. Database truth of the week

2. What's wrong with this database picture?

Friday, December 29, 2017

DBMS for Analytics: Risky Business Without Foundation Knowledge, Part 2

Thursday, December 21, 2017

Re-writes (with Season's Greetings)