DATABASE DEBUNKINGS: SK

Showing posts with label SK. Show all posts

Monday, July 31, 2023

ON RELATIONAL KEYS (& DOMAINS) (t&n)

Note: "Then & Now" (t&n) is a new version of what used to be the "Oldies but Goodies" (OBG) series. To demonstrate the superiority of a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, as well as the disregarded evolution/progress of RDM, I am re-visiting my old debunkings, bringing them up to the current state of knowledge. This will enable you to judge how well arguments have held up and realize the increasing gap between industry stagnation -- and scientific progress.

Then

ON KEYS (originally published September 2002)

“I am an application developer who is competent in SQL and have designed some small-scale schemas that have gotten the job done. I came to your website seeking enlightment on the issue of whether it is better to use strings to represent attributes (I believe these are called "speaking keys"), or to use integers and key them to a list of strings. I see that you are avidly pro-normalization, but I'm not even sure if this is a normalization issue. Here is a quote from a web site that captures my gut feeling. I hope you can comment or point me to an article that presents your view of the matter.

Many popular database tools make it very easy to create some sort of arbitrary numeric sequence for your primary keys. Be careful not to use these to excess. If you have are designing a music catalog system, it is definitely not useful to create a lot of arbitrary codes for music classification. There is unlikely to be much benefit in designating "1" to stand in for "Jazz", 2 for "Rock", 3 for "Classical" and 4 for "Latin Jazz" in an application. Why not just use the word "Jazz" for "Jazz" and use "Rock" to stand for "Rock"? It is certainly easier to remember. It also means that you won't have to lookup the value of 1 to see that it stands for "Jazz".”

KEYS & INDEXES (sms)

Follow @DBDebunk Follow @ThePostWest

Note: "Setting Matters Straight" is a new format: I post on LinkedIn an online Q&A involving data fundamentals to encourage readers to test their foundation knowledge, which they can then compare with our debunking here, where we confirm what is correct and correct what is fallacious (with clarifications, wherever necessary). For in-depth treatment check out the POSTS and our PAPERS, LINKS and BOOKS (or organize one of our on-site/online SEMINARS, which can be customized to specific needs).

Q: “What is the difference between a primary key, a unique key, and an index in databases?”

A: “Unique key is a field (or fields) with a set of unique values; the uniqueness is usually enforced with UNIQUE constraint. There might be one or more per table. Every PRIMARY key is always a unique key; there should be only one per table. It uniquely identifies record, and is used to enforce integrity - entity integrity, and, in tandem with FOREIGN key, referential integrity. Index is a data structure to facilitate records search. It might be created on PRIMARY key (best practice), unique key or any other field or combination thereof in the table. The limit on how many indices a table might have is defined in RDBMS implementation. An index might - or might not - speed up some queries.”

A: “The primary key is inherently indexed and unique and is the cross reference to related tables. Often the best primary key is an auto number integer as any value entered by humans is subject to error or delay that can be challenging to manage in the user interface ... whereas an auto number is assigned immediately upfront and eliminates any possible record conflict in tables during multi user entries. A unique key is somewhat of an informal definition. My view is that it is a definition of a field that is not being used as the primary key, but is unique unlike i.e. Last Name -- for instance a social security number field. So it is not the primary key as it is not the field/value being used to cross reference to related tables but it is unique in the table.”

A: “A primary key is a unique, non null value which can identify every tuple (row in the table) uniquely. A unique key/column/constraint ensures that no two rows contain the same value (almost the same as primary key). Unless specified explicitly for the column configuration, a NULL is a valid value for column with unique constraint. A index can be thought of as the appendix at the end of the book. The information is sorted in specific order so that look up is easy and it points to the location that is being searched for.”

--Quora.com

NO UNDERSTANDING WITHOUT FOUNDATION KNOWLEDGE PART 5: DEBUNKING AN ONLINE EXCHANGE 4 (obg)

Follow @DBDebunk Follow @ThePostWest

Note: To demonstrate the correctness and stability due to a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old DBDebunk.com (2000-06), Judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may revise, break into parts, and/or add comments and/or references. You can acquire foundation knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, even better, organize one of our on-site SEMINARS, which can be customized to specific needs).

A 2001 review of my third book triggered an exchange on SlashDot. This six-part series comprises my debunking at the time of both the review and the exchange in the chronological (slightly out of the) order of the original publication.

Part 1: Clarifications on a Review of My Book Part 1 @DBDebunk.com

Part 2: Slashing a SlashDot Exchange Part 1 @DBAzine.com

Part 3: Slashing a SlashDot Exchange Part 2 @DBAzine.com

Part 4: Slashing a SlashDot Exchange Part 3 @DBAzine.com

Part 5: Slashing a SlashDot Exchange Part 4 @DBAzine.com

Part 6: Clarifications on a Review of My Book Part 2 @DBDebunk.com

Slashing a Slashdot Exchange - Part 1

(first published @DBAzine.com in 2001)

I was recently contacted by a reporter for an interview. When I expressed my disappointment with the trade media’s tendency to regurgitate vendor marketing claims instead of assessing them, he admitted "that is what happens about 98 percent of the time", but added "There are some outlets with a good piece from time to time that deal with serious architecture issues", mentioning SlashDot as one of them.

There is, of course, a Catch 22 here: to judge the seriousness of such outlets, foundation and substantive knowledge is necessary in the first place. And, alas, reporters possess even less of it than vendors and users (see, for example, The Ignorance Mechanism, On Trade Media’s "Balance"),
without which sources may appear serious even when they are nothing of the sort. As luck would have it, I ran into a good opportunity to prove this point for SlashDot. It so happened that shortly after my exchange with the journalist, Database Debunkings experienced a sudden ten-fold increase in traffic. Now, [given that my target audience is thinking practitioners,] were my material to suddenly become "hot", I would worry as to where I did go wrong. But the odds for that are rather slim and, fortunately, there was no need for concern: an email from a reader informed me that "there recently was an article posted to SlashDot.org which refers to Dbdebunk.com and Mr. Pascal/Date" and "There [were] some 443 comments to that posting." Such volume is practically always indicative of heat (hot air, to be more precise), rather than light. Ah, well, I thought, yet another source of weekly quotes (as if one was needed).

Fourth Order Properties Part 1: Association Relations vs. Foreign Keys

Follow @DBDebunk Follow @ThePostWest

“We have Building, Room, and Bed entities. Logically, if this is in the scope of some hypothetical hotel, then each one of those entities is dependent on their parent to exist ... you cannot have a bed without a room. Also, that room wouldn't exist without its parent, Building. So, why have I rarely seen this identifying relationship introduced? When I was learning databases, everything was apparently "non-identifying". When is this type of relationship necessary, if at all? I see the issue arises when that BED can exist without a BUILDING. If you were to INSERT into the BED table, you are constraint [sic] to provide a building_id, as the building_id is part of that BED's primary key. Couldn't you avoid an identifying relationship by giving each table its own surrogate primary key? Is this the correct representation of an identifying relationship? I could avoid that by just giving each table its own ID. At the end of the day, this is about IDENTIFYING relationships, not their existence, which is how I've been logically determining if something is an "identifying relationship" If that were the case, then any 1:N relationship could be "identifying" but that's not how you define identifying or non-identifying.”

“Interesting -- I’d never heard this term before. I’ve hears it referred to as a cached ID though, as that 2nd ID isn’t required, but may be beneficial for performance purposes. For this example with 3 levels it’s not a huge joint statement, but for some systems with 12 tables the joins get unpleasant. I’ve never started a system with this additional id, but I have added one later on once the need was there and the profiling led to this being the best solution for our specific situation. Usually though, just creating a view that does the joins for me has been easier. I’ll be curious what has led others to use this approach.”

“It's not really introduced because it's way more towards academic than functional.”

--Reddit.com

Such questions, and ad-hoc terms like "identifying relationships"[1] come up because practice is driven by intuition and experience (if any), without the benefit of foundation knowledge[2]. Whether practitioners know/like it or not, a database is a formal computable representation of an informal conceptual model[3] and, therefore, data modeling (i.e., logical database design)[4] is impossible without (1) a well-defined and complete conceptual model and (2) a formal data model with which to formalize it as a logical model[5]and the two should not be confused[6]. Otherwise all bets are off.

Here's how foundation knowledge should have informed modeling and design.

A New Understanding of Keys Part 3: Surrogate Key Illusions

Follow @DBDebunk Follow @ThePostWest

Note: This the third of three re-writes of older posts to bring them in line with McGoveran's formalization and interpretation[1] of Codd's true RDM. They are short extracts from a completely rewritten paper #4 in the PRACTICAL DATABASE FOUNDATIONS series[2] that provides a new perspective on relational keys, distinct from the conventional wisdom of the last five decades.

(Continued from Part 2)

"When defining a surrogate primary key for a [SQL Server] table, two options are the most common: Integer and UniqueIdentifier (aka Globally Unique Identifiers, or GUID's) ... Historically, Integer has been the logical choice. It’s human-readable, requires minimal storage, and can be set as an identity (auto-incrementing) to prevent the need for additional application logic. UniqueIdentifier comes with significant disadvantages. The most immediately noticeable is that it’s user-unfriendly. You’ll never hear a user or developer ask you about record “A78383A3-4AB1-42CF-B3FC-A4A23AD10398”. With high availability and replication becoming highly prevalent, UniqueIdentifier is being chosen more often, but has caveats that mean it isn’t always the optimal solution."

--Jeffrey J. Keller, Vertabelo.com

As we explained in Parts 1 and 2, keys can be properly understood only within the RDM. We revealed a new perspective on keys, discussed relationally valid kinds of keys, and revised definitions of natural (NK) and surrogate keys (SK).

As we have seen, the formal PK mandate is distinct from PK selection, which may be pragmatic. A PK must represent a name -- either pre-assigned, or generated only when there is no simple name CK. Generated keys must ensure entity integrit and are managed by the DBMS transparently to users.

All this is absent from conventional wisdom and database practice, as the above example illustrates: generated SKs are overused for the wrong reasons, the most common being emulation of OIDs (a SK -- often database-wide and, so, unique across relationsn), followed by performance.

Note: While OIDs have unique values, they often also have some physical significance.

A New Understanding of Keys Part 2: Kinds of Keys

Follow @DBDebunk Follow @ThePostWest

Note: This the second of three re-writes of older posts to bring them in line with McGoveran's formalization and interpretation[1] of Codd's true RDM. They are short extracts from a completely rewritten paper #4 in the PRACTICAL DATABASE FOUNDATIONS series[2] that provides a new perspective on relational keys, distinct from the conventional wisdom of the last five decades.

(Continued from Part 1)

"Many data and information modelers talk about all kinds of keys (or identifiers. I'll forego the distinction for now). I hear them talk about primary keys, alternate keys, surrogate keys, technical keys, functional keys, intelligent keys, business keys (for a Data Vault), human keys, natural keys, artificial keys, composite keys, warehouse keys or Dimensional Keys (or Data Warehousing) and whatnot. Then a debate rises on the use (and misuse) of all these keys ... The foremost question we should actually ask ourselves: can we formally disambiguate kinds of keys (at all)? Of all kinds of key, the primary key and the surrogate key gained the most discussion."

"If we take a look at the relational model we only see of one or more attributes that are unique for each tuple in a relation -- no other formal distinction is possible. When we talk about different kinds of keys we base our nomenclature on properties and behavior of the candidate keys. We formally do not have a primary key, it is a choice we make and as such we might treat this key slightly different from all other available keys in a relation. The discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys."

--Martijn Evers,dm-unseen.blogspot.com

I've deplored the misuse and abuse of terminology due a general lack of foundation knowledge in the industry [3] for longer than I care to remember, and keys are not an exception. If "the discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys", then there is no understanding of relational keys whatsoever: whatever it is, a data structure that contains NULLs is not a relation, one reason for which SQL tables are not relations, SQL databases are not relational and SQL DBMSs are not RDBMSs (for a relational solution to missing data without NULLs see[4]).

We sure can disambiguate, but the key (pun intended) to keys is that they are a relational feature and, thus, can only be properly understood within the dual theoretical foundation of the RDM, which is an adaptation and application of simple set theory (SST) expressible in first order predicate logic (FOPL) to database management. Thus, their "nomenclature on properties and behavior" should reflect what from the real world they represent, and what function they fulfill in the RDM. Which is precisely what the industry disregards.

A New Understanding of Keys Part 1: Primary Key Formal Mandate and Pragmatic Selection

Follow @DBDebunk Follow @ThePostWest

Note: This the first of three re-writes of older posts to bring them in line with McGoveran's formalization and interpretation[1] of Codd's true RDM. They are short extracts from a completely rewritten paper #4 in the PRACTICAL DATABASE FOUNDATIONS series[2] that provides a new perspective on relational keys, distinct from the conventional wisdom of the last five decades.

"The Internet is full of dogmatic commandments for choosing and using keys in relational databases. At times it verges on a holy war: should you use natural or artificial keys? Auto-incrementing integers, UUIDs? After wading through sixty-four articles, skimming sections in five books, and asking questions on IRC and StackOverflow I think I’ve put the pieces together and have a recommendation to harmonize the various camps. Many arguments about keys boil down to false dichotomies and failures to acknowledge other points of view."

--Joe Nelson, begriffs.com

As a relational feature, keys can only be properly understood within the formal foundation of the RDM, which is simple set theory (SST) expressible in first order predicate logic (FOPL), adapted and applied to database management. Yet that is precisely what is ignored and dismissed in the industry -- including by the authors of SQL. Dogma and holy war are products of ignorance. What Nelson did "piece together" from "sixty-four articles, five books and asking questions" is conventional wisdom, which cannot produce understanding because it has been off for decades.

Name the Relational Violation Part 2: Self-defeating Constraint

Follow @DBDebunk Follow @ThePostWest

Note: This two part series is a rewrite of of an older post (which now links here), to bring it into line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

(Continued from Part 1)

In Part 1 I how several data practitioners failed to pinpoint the relational violation by a a conditional uniqueness constraint that should have been obvious with foundation knowledge. The closest one came was "more than one kind of business entity here [that] share the same properties (not attributes)", but still missed the implications.

Name the Relational Violation Part 1: Conditional Uniqueness Constraint

Follow @DBDebunk Follow @ThePostWest

Note: This is a rewrite of of an older post (which now links here), to bring it into line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

"I'm seeing more [data] professionals implementing the following type of conditional unique constraints, typically related to the use of 'soft deletes'):
TABLEX (SK,A,B,C,ACT_FLAG)
Uniqueness is defined for (A,B) iff ACT_FLAG='Yes'. Any row with a ACT_FLAG = 'No' is excluded from the unique requirement."

"Most SQL DB implementations I know of do not allow this type of constraint to be enforced declaratively. Instead it relies on tricks within the index specification for enforcement."

"This conditional application of unique constraints troubles me. The prevalent use of a surrogate primary key avoids duplicates in the table as a whole. But this approach seems to declare a business rule that can be turned off and on based on the value of a non-key column. It definitely feels there is more than one kind of business entity here -- but they do share the same attributes. I also see a similar design pattern when OO classes are mapped to a table during implementation (when the implementation approach is to combine classes into a single table and an attempt is made to enforce two types of 'uniqueness'."

"However, I cannot definitively find a specific rule/guideline within relational theory that it violates. I know there are design alternatives. The fact that there is no true always-on business key other than the surrogate key IS an issue, but the fact that technically the surrogate PK prevents duplicates is almost always presented as a counter argument."

The heart is in the right place, but the violation is obvious. Yet, the several responses did not pinpoint it.

The Key to Relational Keys - A New Understanding

Follow @DBDebunk Follow @ThePostWest

Version 3 of paper #4 in the PRACTICAL DATABASE FOUNDATION series is now available to order here. The Key To Relational Keys: A New Understanding is a completely new, re-written version that incorporates David McGoveran's formalization and interpretation of Codd's true RDM and introduces a new perspective. It is distinct from and supersedes all previous versions.

Abstract

The dual theoretical foundation of the RDM — simple set theory (SST) expressible in first order predicate logic (FOPL) — is applicable to database management because it is a theory of unique objects and objects are unique in the real world that databases represent. In the real world objects are uniquely identified by (1) a combination of one or more defining properties and/or (2) names assigned to them as members of various groups. A database relation is a formal representation of an object group, with tuples representing (facts about) and attributes representing properties and/or names of object members. Relational keys are attributes that represent formally in the database those identifying properties and names.

Their fundamental database role notwithstanding, relational keys are poorly understood. 70% of hits @dbdebunk.com are about keys and misconceptions about their necessity, what kinds of keys are relational, their functions, their selection, and so on, abound.
This paper defines and explains:

The relational key concept
The kinds of relational keys, their properties and functions
The formal PK mandate
PK selection
RDBMS key support

and debunks common misconceptions.

POSTS

Monday, July 31, 2023

ON RELATIONAL KEYS (& DOMAINS) (t&n)

Then

Friday, March 25, 2022

KEYS & INDEXES (sms)

Sunday, February 13, 2022

NO UNDERSTANDING WITHOUT FOUNDATION KNOWLEDGE PART 5: DEBUNKING AN ONLINE EXCHANGE 4 (obg)

Slashing a Slashdot Exchange - Part 1

Saturday, March 2, 2019

Fourth Order Properties Part 1: Association Relations vs. Foreign Keys

Sunday, April 29, 2018

A New Understanding of Keys Part 3: Surrogate Key Illusions

Sunday, April 22, 2018

A New Understanding of Keys Part 2: Kinds of Keys

Sunday, April 15, 2018

A New Understanding of Keys Part 1: Primary Key Formal Mandate and Pragmatic Selection

Saturday, April 7, 2018

Name the Relational Violation Part 2: Self-defeating Constraint

Tuesday, March 27, 2018

Name the Relational Violation Part 1: Conditional Uniqueness Constraint

Sunday, February 11, 2018

The Key to Relational Keys - A New Understanding

Abstract