DATABASE DEBUNKINGS

Sunday, April 29, 2018

A New Understanding of Keys Part 3: Surrogate Key Illusions

Follow @DBDebunk Follow @ThePostWest

Note: This the third of three re-writes of older posts to bring them in line with McGoveran's formalization and interpretation[1] of Codd's true RDM. They are short extracts from a completely rewritten paper #4 in the PRACTICAL DATABASE FOUNDATIONS series[2] that provides a new perspective on relational keys, distinct from the conventional wisdom of the last five decades.

(Continued from Part 2)

"When defining a surrogate primary key for a [SQL Server] table, two options are the most common: Integer and UniqueIdentifier (aka Globally Unique Identifiers, or GUID's) ... Historically, Integer has been the logical choice. It’s human-readable, requires minimal storage, and can be set as an identity (auto-incrementing) to prevent the need for additional application logic. UniqueIdentifier comes with significant disadvantages. The most immediately noticeable is that it’s user-unfriendly. You’ll never hear a user or developer ask you about record “A78383A3-4AB1-42CF-B3FC-A4A23AD10398”. With high availability and replication becoming highly prevalent, UniqueIdentifier is being chosen more often, but has caveats that mean it isn’t always the optimal solution."

--Jeffrey J. Keller, Vertabelo.com

As we explained in Parts 1 and 2, keys can be properly understood only within the RDM. We revealed a new perspective on keys, discussed relationally valid kinds of keys, and revised definitions of natural (NK) and surrogate keys (SK).

As we have seen, the formal PK mandate is distinct from PK selection, which may be pragmatic. A PK must represent a name -- either pre-assigned, or generated only when there is no simple name CK. Generated keys must ensure entity integrit and are managed by the DBMS transparently to users.

All this is absent from conventional wisdom and database practice, as the above example illustrates: generated SKs are overused for the wrong reasons, the most common being emulation of OIDs (a SK -- often database-wide and, so, unique across relationsn), followed by performance.

Note: While OIDs have unique values, they often also have some physical significance.

A New Understanding of Keys Part 2: Kinds of Keys

Follow @DBDebunk Follow @ThePostWest

Note: This the second of three re-writes of older posts to bring them in line with McGoveran's formalization and interpretation[1] of Codd's true RDM. They are short extracts from a completely rewritten paper #4 in the PRACTICAL DATABASE FOUNDATIONS series[2] that provides a new perspective on relational keys, distinct from the conventional wisdom of the last five decades.

(Continued from Part 1)

"Many data and information modelers talk about all kinds of keys (or identifiers. I'll forego the distinction for now). I hear them talk about primary keys, alternate keys, surrogate keys, technical keys, functional keys, intelligent keys, business keys (for a Data Vault), human keys, natural keys, artificial keys, composite keys, warehouse keys or Dimensional Keys (or Data Warehousing) and whatnot. Then a debate rises on the use (and misuse) of all these keys ... The foremost question we should actually ask ourselves: can we formally disambiguate kinds of keys (at all)? Of all kinds of key, the primary key and the surrogate key gained the most discussion."

"If we take a look at the relational model we only see of one or more attributes that are unique for each tuple in a relation -- no other formal distinction is possible. When we talk about different kinds of keys we base our nomenclature on properties and behavior of the candidate keys. We formally do not have a primary key, it is a choice we make and as such we might treat this key slightly different from all other available keys in a relation. The discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys."

--Martijn Evers,dm-unseen.blogspot.com

I've deplored the misuse and abuse of terminology due a general lack of foundation knowledge in the industry [3] for longer than I care to remember, and keys are not an exception. If "the discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys", then there is no understanding of relational keys whatsoever: whatever it is, a data structure that contains NULLs is not a relation, one reason for which SQL tables are not relations, SQL databases are not relational and SQL DBMSs are not RDBMSs (for a relational solution to missing data without NULLs see[4]).

We sure can disambiguate, but the key (pun intended) to keys is that they are a relational feature and, thus, can only be properly understood within the dual theoretical foundation of the RDM, which is an adaptation and application of simple set theory (SST) expressible in first order predicate logic (FOPL) to database management. Thus, their "nomenclature on properties and behavior" should reflect what from the real world they represent, and what function they fulfill in the RDM. Which is precisely what the industry disregards.

A New Understanding of Keys Part 1: Primary Key Formal Mandate and Pragmatic Selection

Follow @DBDebunk Follow @ThePostWest

Note: This the first of three re-writes of older posts to bring them in line with McGoveran's formalization and interpretation[1] of Codd's true RDM. They are short extracts from a completely rewritten paper #4 in the PRACTICAL DATABASE FOUNDATIONS series[2] that provides a new perspective on relational keys, distinct from the conventional wisdom of the last five decades.

"The Internet is full of dogmatic commandments for choosing and using keys in relational databases. At times it verges on a holy war: should you use natural or artificial keys? Auto-incrementing integers, UUIDs? After wading through sixty-four articles, skimming sections in five books, and asking questions on IRC and StackOverflow I think I’ve put the pieces together and have a recommendation to harmonize the various camps. Many arguments about keys boil down to false dichotomies and failures to acknowledge other points of view."

--Joe Nelson, begriffs.com

As a relational feature, keys can only be properly understood within the formal foundation of the RDM, which is simple set theory (SST) expressible in first order predicate logic (FOPL), adapted and applied to database management. Yet that is precisely what is ignored and dismissed in the industry -- including by the authors of SQL. Dogma and holy war are products of ignorance. What Nelson did "piece together" from "sixty-four articles, five books and asking questions" is conventional wisdom, which cannot produce understanding because it has been off for decades.

Name the Relational Violation Part 2: Self-defeating Constraint

Follow @DBDebunk Follow @ThePostWest

Note: This two part series is a rewrite of of an older post (which now links here), to bring it into line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

(Continued from Part 1)

In Part 1 I how several data practitioners failed to pinpoint the relational violation by a a conditional uniqueness constraint that should have been obvious with foundation knowledge. The closest one came was "more than one kind of business entity here [that] share the same properties (not attributes)", but still missed the implications.

Name the Relational Violation Part 1: Conditional Uniqueness Constraint

Follow @DBDebunk Follow @ThePostWest

Note: This is a rewrite of of an older post (which now links here), to bring it into line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

"I'm seeing more [data] professionals implementing the following type of conditional unique constraints, typically related to the use of 'soft deletes'):
TABLEX (SK,A,B,C,ACT_FLAG)
Uniqueness is defined for (A,B) iff ACT_FLAG='Yes'. Any row with a ACT_FLAG = 'No' is excluded from the unique requirement."

"Most SQL DB implementations I know of do not allow this type of constraint to be enforced declaratively. Instead it relies on tricks within the index specification for enforcement."

"This conditional application of unique constraints troubles me. The prevalent use of a surrogate primary key avoids duplicates in the table as a whole. But this approach seems to declare a business rule that can be turned off and on based on the value of a non-key column. It definitely feels there is more than one kind of business entity here -- but they do share the same attributes. I also see a similar design pattern when OO classes are mapped to a table during implementation (when the implementation approach is to combine classes into a single table and an attempt is made to enforce two types of 'uniqueness'."

"However, I cannot definitively find a specific rule/guideline within relational theory that it violates. I know there are design alternatives. The fact that there is no true always-on business key other than the surrogate key IS an issue, but the fact that technically the surrogate PK prevents duplicates is almost always presented as a counter argument."

The heart is in the right place, but the violation is obvious. Yet, the several responses did not pinpoint it.

Data Modeling and NoSQL

Follow @DBDebunk Follow @ThePostWest

Revised 3/25/18

"To the question “How relevant is data modeling in the world of NoSQL?” I give the following answer.

The main purpose of data modeling is to understand the business, some application domain, some users world. The model becomes a representation of that world -- the "things" in it, the relationships among those things and any constraints on those things or relationships. A secondary purpose is to build a database to contain information which pertains to and describes that domain."

"Generally we speak of the model coming first, then the implementation, and finally, the data gets collected and stored according to the model. Hence, the business data model should not be concerned with issues of physical stored representation, or the transformations/manipulations/constraints which are imposed to facilitate implementation in some data (storage) management system. That could be a relational DBMS, or a NoSQL tool".

" ... increasingly the data already exists in some form. Which leaves us with the task of figuring out what it means, what it represents -- that is, understanding the data as it represents some user domain. NoSQL tools are often designed to deal with existing data and to process it more efficiently (that may be an oversimplification!). Either way, you must understand the business in order to make sense of the data."

--Gordon Everest, LinkedIn.com

I have written extensively on the three levels of representation and four types of model and I won't repeat it here -- readers can refresh their memory if necessary[1,2]. Everest's comments are at best ambiguous with respect to the levels and models (e.g., by data modeling he means business modeling, and his "business data model" lumps together business model and data model). It is to avoid such ambiguities and the resulting confusion that I recommend the three-fold terminology of conceptual modeling, logical database design and physical implementation, eschewing data modeling[3]. Here I will rely on my earlier writings to address strictly the issue of data modeling in the NoSQL context raised by Everest.

Physical Independence Part 2: Logical-physical Confusion

Follow @DBDebunk Follow @ThePostWest

Note: This is a rewrite of older posts (which now link here), to bring them into line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

Revised 3/17/18

(Continued from Part 1)

This is the second part of my response today to an old DBDebunk query:

"You constantly remind us that the relational model is a logical model having no connection to any physical model (so I infer). You also indicate how no commercial product fully implements the relational model. Therefore, how do we make use of the relational model when dealing with the physical constructs of a commercial database program (Oracle, Access, DB2, etc.)?" --DBDebunk.com

In Part 1 I explained physical independence (PI) and claimed that the industry has failed to internalize its importance. Here I provide evidence to that effect and discuss some consequences.

POSTS

Sunday, April 29, 2018

A New Understanding of Keys Part 3: Surrogate Key Illusions

Sunday, April 22, 2018

A New Understanding of Keys Part 2: Kinds of Keys

Sunday, April 15, 2018

A New Understanding of Keys Part 1: Primary Key Formal Mandate and Pragmatic Selection

Saturday, April 7, 2018

Name the Relational Violation Part 2: Self-defeating Constraint

Tuesday, March 27, 2018

Name the Relational Violation Part 1: Conditional Uniqueness Constraint

Saturday, March 24, 2018

Data Modeling and NoSQL

Saturday, March 17, 2018

Physical Independence Part 2: Logical-physical Confusion