DATABASE DEBUNKINGS: POOD

Showing posts with label POOD. Show all posts

Monday, June 19, 2023

PREDICATE LOGIC, SEMANTICS AND RDM (sms)

Note: In "Setting Matters Straight" posts I debunk online pronouncements that involve fundamentals which I first post on LinkedIn. The purpose is to induce practitioners to test their foundation knowledge against our debunking, where we explain what is correct and what is fallacious. For in-depth treatments check out the POSTS and our PAPERS, LINKS and BOOKS (or organize one of our on-site/online SEMINARS, which can be customized to specific needs). Questions and comments are welcome here and on LinkedIn.

“As I have said many times, if the original relational model had been based on predicate logic and also the semantics and rules of definitions we'd all be better off now. It wasn't. Full stop.”
--Ronald Ross, LinkedIn.com

Assessing such arguments normally requires clarification of what exactly is meant by "the relational model". Ross does refer specifically to the "original" -- which we take to mean that introduced by Codd in 1969-70 -- but given the massive misuse and abuse in the industry, perceptions of it may well be corrupted (Nobody Understands the Relational Model Semantics, Relational Closure and Database Correctness). Moreover, there are many predicate logic (PL) systems and many ways of categorizing them (1st vs n-th order being only one way) -- we assume Ross means RDM is based on none.

MISSING DATA: RDM VS SQL -- A REAL WORLD COMPARISON (t&n)

Follow @DBDebunk Follow @ThePostWest

Note: "Then & Now" (t&n) is a new version of what used to be the "Oldies but Goodies" (OBG) series. To demonstrate the superiority of a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, as well as the disregarded evolution/progress of RDM, I am re-visiting my old debunkings, bringing them up to the current state of knowledge. This will enable you to judge how well arguments have held up and realize the increasing gap between industry stagnation -- and scientific progress.

(This a revised version of an earlier post with clarity improvements).

Q: “What would you suggest for a datetime field where the value is not known and should therefore be not-applicable?”
A: ”NULL sounds good to me.”

While searching through records I came across an old consulting project involving the migration of a neo-natal research database from Focus -- an old hierarchic DBMS -- used to record extensive details about hundreds of monthly births at a university hospital for more than 20 years. The person who had designed the Focus database was the only one who knew and understood its complexity sufficiently to maintain it. Each time a researcher needed some subset of data to analyze, he would extract it and serve it upon request. Aside from the inefficiency of the process, the person was retiring at a time when hierarchic DBMSs reached the end of their usefulness, Focus experts were already few and expensive and "relational" (read: SQL) was the dominant fad.

NOBODY UNDERSTANDS DATABASE DESIGN 1 (sms)

Follow @DBDebunk Follow @ThePostWest

In a previous SMS post I debunked an attempt to express something important about database practice that was handicapped by lack of foundation knowledge. Here is another example.

“This Codd guy might have been onto something. Unfortunately, normalization is usually taught in a somewhat backwards, overly technical way. If you start with concepts, connections between them and details about them, you usually are already at a fairly high normal form without going through any formal normalization steps.”
--LinkedIn.com

DATABASE RELATIONS, DATABASE DESIGN & CORRECTNESS (sms)

Follow @DBDebunk Follow @ThePostWest

Note: In "Setting Matters Straight" posts I debunk online Q&As that involve fundamentals which I first post on LinkedIn. The purpose is to induce practitioners to test their foundation knowledge against our debunking, where we explain what is correct and what is fallacious. For in-depth treatments check out the POSTS and our PAPERS, LINKS and BOOKS (or organize one of our on-site/online SEMINARS, which can be customized to specific needs). Questions and comments are welcome here and on LinkedIn.

“...The relational model organizes data through relations (aka tables). You then normalize it in one of six forms. By normalizing data you:
- Reduce redundancy
- Ensure consistency
- Optimize for atomic inserts, updates and deletes
The biggest drawback ... are keys that let you join different tables across multiple systems.”

--LinkedIn.com

QUOTA QUERIES (sms)

Follow @DBDebunk Follow @ThePostWest

Note: "Setting Matters Straight" (SMS) is a new format: I post on LinkedIn an online Q&A involving data fundamentals that I subsequently debunk in a post here. This is to encourage readers to test their foundation knowledge against our debunking here, where we confirm what is correct and correct what is fallacious. For in-depth treatments check out the POSTS and our PAPERS, LINKS and BOOKS (or organize one of our on-site/online SEMINARS, which can be customized to specific needs).

Q: “How do you return the most recent record in SQL?”

A: “There are many ways of doing it. I would suggest (first thing came to my mind):

Select Top 1
from YourTable
order by TablePrimaryKey Desc;”

A: “If you mean "the last inserted record which has no datetime stamp field" ... you have a few options.
If you cannot use date/time -- your next best bet would be an auto-increment/sequence field, which assigns increasing numbers to each inserted record.
If that’s not available, you would have to rely on business logic e.g. order # or some such.

Some vendors, like Oracle, provide ROWID pseudocolumn for each record which might help in some quick’n’dirty cases -- it is not guaranteed to be sequential but could be (e.g., when table has had no DELETE operations).” --Quora.com

If you don't know, I set matters straight @dbdebunk.com.

HOW NOT TO EXPLAIN THE RELATIONAL MODEL (tyfk)

Follow @DBDebunk Follow @ThePostWest

Note: Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can review references -- reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date -- which explain and correct the misconceptions. You can acquire further knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

“The key idea is "Parent-Child" relationship. Entities ~ Relations ~ Tables (tilde stands for "more or less like"). Concept of a Table resonates with most of the people just as everybody intuitively grasps a concept of "rows and columns” but might struggle with "tuples and attributes". Explain relations and relationships, 1:1, 1:N, N:N etc. Explain rationale for this way of collecting and storing data, touch upon data normalization, and tell a few anecdotes about cost of storage back in 1970 and Y2K problem it have caused; add that we have inadvertently created Y10K problem while fixing it (not exactly true but not wrong either). Show an ERD diagram, trace the relationships, introduce SQL, maybe run a few simple SELECT queries to help your listeners visualize it, including equijoin and ORDER BY. Save other JOIN types, data types and other, more advanced topics, and for the next encounter.”
--Quora.com

An excellent example that validates my claim of lack of foundation knowledge in the industry: most "explainers" of RDM have acquired relational jargon, but do not know or understand it at all.

Nobody Understands the Relational Model: Semantics, Relational Closure and Database Correctness Part 2

Follow @DBDebunk Follow @ThePostWest

with David McGoveran

(Title inspired by Richard Feynman)

In Part 1 we explained that all database relations are, mathematically, relations, but not all relations are database relations, which are in both 1NF and 5NF and we agreed with a statement in a LinkedIn discussion ending as follows: "Update anomalies are not as big of a problem as an algebra where relations aren't closed under join". Unfortunately, update anomalies, closure, and how relational operators were defined are all interrelated and represent an even "bigger problem". Update anomalies are not "bugs", let alone irrelevant, but actually a reflection of that much bigger problem.

In this second part we delve into that problem.

Nobody Understands the Relational Model: Semantics, Relational Closure and Database Correctness Part 1

Follow @DBDebunk Follow @ThePostWest

with David McGoveran

(Title inspired by Richard Feynman)

“As currently defined, relational algebra produces anomalies when applied to non-5NF relations. Since an algebra cannot have anomalies, they should have raised a red flag that RA was not defined quite right, especially defining "relation" as a 1NF table and claiming algebraic closure because 1NF was preserved. Being restricted to tabular representation as the "language" for relationships is like being restricted to arithmetic when doing higher mathematics like differential calculus -- you need more expressive power, not less! Defining RA operations in terms of table manipulations aided initial learning and implementations by making data management look simple and VISUAL. Unfortunately, it was never grasped how much was missing, let alone how much more "intelligent" the RA and the RDBMS needed to be made to fix the problems. And I can see that those oversights were, in part, probably due to having to spend so much time correcting the ignorance in the industry."
--David McGoveran

I recently posted the following Fundamental Truth of the Week on LinkedIn, together with links to more detailed discussions of 1NF and 5NF (see References):

“According to conventional understanding of the RDM (such as it is) [and I don't mean SQL], a relation is in at least first normal form (1NF) -- it has only attributes drawn from simple domains (i.e., no "nested relations") -- the formal way of saying that a relation represents at the logical level an entity group from the conceptual level that has only individual entities -- no groups thereof -- as members. 1NF is required for decidability of the data sublanguage.

However, correctness, namely (1) system-guaranteed logical validity (i.e., query results follow provably from the database) and (2) by-design semantic consistency (of query results with the conceptual model) requires that relations are in both 1NF and fifth normal form (5NF). Formally, the only dependencies that hold in a 5NF relation are functional dependencies of non-key attributes on the PK -- for each PK value there is exactly one value of every corresponding non-key attribute value. This is the formal way of saying that a relation represents facts about a group of entities of a single type.

Therefore we now contend that database relations are BY DEFINITION in both 1NF and 5NF, otherwise all bets are off.”

It triggered a discussion that raised some fundamental issues for which an online exchange is too limiting. This post offers further clarifications, including comments by David McGoveran, on whose interpretation of the RDM (LOGIC FOR SERIOUS DATABASE FOLKS, forthcoming) I rely on. The portions of my interlocutor in the discussion are in quotes.

TYFK: Calculated Attributes -- Redundancy, Full Normalization and Relational Theory

Follow @DBDebunk Follow @ThePostWest

“If you have shopping cart, you probably have some field "TOTAL" somewhere that stores the final amount due for the customer. It so happens that such a thing violates relational theory...”

“Having a "TOTAL" field in your "order" table *might* violate relational theory, but if you make it so that only a trigger can update it based on what's in your "order_item" table, then I think it's fine. You still get data integrity and that is what matters.”

“I still fail to see what you mean by the "calculated TOTALS field" (attribute, really) violates the Relational Model.”

“The result of having the field ... is what is called a DELETE ANOMALY.”

“Most denormalizing means adding columns to tables that provide values you would otherwise have to calculate as needed.”

“There are four practical problems with a fully normalized database, three of which I have listed before. I will list them all here for completeness:
* No calculated values. Calculated values are a fact of life for all applications, but a normalized database lacks them. The burden of providing calculated values must be taken up by somebody somehow. Denormalization is one approach to this, though there are others.”
--Database Programmer blog

“...I'm now working with IT to normalize part of the database to remove calculated fields...:
`lineitems`.`extended total` = `lineitems`.`units` * `biditems`.`price`.
`jobs`.`jobvalue` = the sum of related `lineitems`.`extended total` records
`orders`.`ordervalue` = the sum of related `jobs`.`jobvalue` records.”
--mySQL.com

Do calculated attributes (not fields!) violate relational theory and must be "normalized" out of them? Determining that requires foundation knowledge that is scarce in the industry, which has a poor and outdated understanding of the RDM.

Understanding Relational Constraints

Follow @DBDebunk Follow @ThePostWest

“The data in a relational database is stored in form of a table. A table makes the data look organized. Yet in some cases we might face issues while working with the data like repetition. We might want enforce rules on the data to avoid such technical problems. Theses rules are called constraints. A constraint can be defined as a rule that has to enforced on the data to avoid faults. There are three kinds of constraints: entity, referential and semantic constraints. Listed below are the differences between these three constraints:
1. Entity constraints -- primary key, foreign key, unique, NULL -- are posed within a table and used to enforce uniqueness and to define no value [respectively].
2. Referential constraints -- foreign key -- are enforced with more than one table for referring other tables for analysis of the data.
3. Semantic constraints -- datatypes -- are enforced in a table on the values of a specific attribute and help the data segregate according to its type. Example: name varchar2(30).”
--GeeksforGeeks.com

Before we tackle the main subject, let's get some misconceptions out of the way. As we have explained so many times:

Data is not "stored in a form of a table" -- it can be stored in any number of physical formats, at the discretion of DBMS designers and DBAs. Physical independence is a core advantage of the RDM.
A table does not "make the data look organized". Data is by definition organized -- be it relationally or not -- otherwise it would be random noise not data. A database relation can be visualized as a R-table, but tables do not play any role in RDM.
While some "repetition" (i.e., redundancy) is prevented by constraints (e.g., uniqueness), others are avoided by database design (e.g., 5NF DB relations).

And now to constraints.

TYFK: Normalized, Fully Normalized, Non-Normalized, Denormalized -- Clearing the Mess

Follow @DBDebunk Follow @ThePostWest

“A non-normalized database is a disorganized one, where nobody has bothered to work out where the facts should be stored. It is like a stack of paper files that has been tossed down the stairs. We are not interested in non-normalized databases.

A normalized database has been organized so that each fact is stored in exactly one place (2nf and greater) and no more than one fact is stored in each place (1nf). In a normalized database there is a place for everything and everything is in its place.

A denormalized database is a normalized database that has had redundancies deliberately re-introduced for some practical gain. Most denormalizing means adding columns to tables that provide values you would otherwise have to calculate as needed. Values are copied from table to table, calculations are made within a row, and totals, averages and other aggregrations are made between child and parent tables.”
--database-programmer.blogspot.com

OBG: Database Design and Guaranteed Correctness Part 2

Follow @DBDebunk Follow @ThePostWest

Note: This is a re-write of an earlier post (which now links here), to bring it into line with the current understanding of the RDM derived from McGoveran formalization and interpretation of Codd's work[1]. Reference [9] is also an important re-write and is recommended pre-requisite for this post.

Continued from Part 1

“The term database design can be used to describe many different parts of the design of an overall database system. Principally, and most correctly, it can be thought of as the logical design of the base data structures used to store the data. In the relational model these are the tables and views ... However, the term database design could also be used to apply to the overall process of designing, not just the base data structures, but also the forms and queries used as part of the overall database application within the database management system(DBMS). The process of doing database design generally consists of a number of steps which will be carried out by the database designer. Usually, the designer must:

Determine the data to be stored in the database.
Determine the relationships between the different data elements.
Superimpose a logical structure upon the data on the basis of these relationships.
Within the relational model the final step above can generally be broken down into two further steps, that of determining the grouping of information within the system, generally determining what are the basic objects about which information is being stored, and then determining the relationships between these groups of information, or objects.”
--What is a Relational Database, Quora.com

There is, typically, much vagueness and confusion here and instead of debunking it makes more sense to provide a rigorous description of what database design really is: formalization of a conceptual model -- expressed as business rules -- as a logical model for representation in the database using a formal data model. If the data model is the RDM, the logical model consists of relations constrained for semantic consistency with the conceptual mode, the constraints being formalizations of the business rules.

TYFK: How (Not) to Compare NoSQL Systems and RDBMSs

Follow @DBDebunk Follow @ThePostWest

Note: About TYFK posts (Test Your Foundation Knowledge) see the post insert below.

“But if you still want to compare NOSQL databases with RDBMS, they primarily vary in
1. "normalization" where RDBMS contains normalized (upto certain degree) data and NOSQL based database contains non-normalized data;
2. RDBMS based databases are (I MUST say, generally and it isn't a criteria) fully ACID compliant while NOSQL databases are partially ACID compliant.
3. RDBMS are much slower and difficult to scale while NOSQL databases are much faster and easily scalable.
4. RDBMS normalization was very useful 50 years ago when cost of disk and memory was high, and computation power was limited. With the revolution in computing power, cheapest disk and memory availability has made RDBMS normalization a matter of joke - many people do not really understand why they need to normalize data in today's time.”

First try to detect the misconceptions, then check against our debunking. If there isn't a match, you can acquire the necessary foundation knowledge in our POSTS, BOOKS, PAPERS, LINKS or, better, organize one of our on-site SEMINARS, which can be customized to specific needs.

Note: In what follows RDBMS refers to a truly relational DBMS (of which currently aren't any), not to be confused with a SQL DBMS.

Meaning Criteria and Entity Supertype-Subtypes Relationships

Follow @DBDebunk Follow @ThePostWest

Note: This is a re-write of a previous post.

"I have a database for a school ... [with] numerous tables obviously, but consider these:

CONTACT - all contacts (students, faculty): has fields such as LAST, FIRST, ADDR, CITY, STATE, ZIP, EMAIL;
FACULTY - hire info, login/password, foreign key to CONTACT;
STUDENT - medical comments, current grade, foreign key to CONTACT."
"Do you think it is a good idea to have a single table hold such info? Or, would you have had the tables FACULTY and STUDENT store LAST, FIRST, ADDR and other fields? At what point do you denormalize for the sake of being more practical? What would you do when you want to close out one year and start a new year? If you had stand-alone student and faculty tables then you could archive them easily, have a school semester and year attached to them. However, as you go from one year to the next information about a student or faculty may change. Like their address and phone for example. The database model now is not very good because it doesn’t maintain a history. If Student A was in school last year as well but lived somewhere else would you have 2 contact rows? 2 student rows? Or do you have just one of each and have a change log. Which is best?"

How would somebody who "does not know past, or new requirements, modeling, and database design" and messes with a working database just because "he heard something about (insert your favorite fad here)" figure out correct from bad answers? Particularly if the answers suffer from the same lack of foundation knowledge as the question?

Normalization and Further Normalization Part 3: Understanding Database Design

Follow @DBDebunk Follow @ThePostWest

Note: This is a re-write of two older posts, to bring them into line with McGoveran's formalization, re-interpretation, and extension[1] of Codd's RDM.

In Part 1 we explained that for a database to be relational, database design must adhere to three core principles, in which case it consists of relations that are by definition in both 1NF and 5NF. In Part 2 we showed that whether tables visualize relations (i.e., are R-tables) can be determined only with reference to the conceptual model that the database designer intended the database to represent (not what any users might think it does). This is obscured by the common and entrenched confusion/conflation of levels of representation and, consequently, of types of model -- conceptual, logical, physical, and data model -- that we have so often debunked[2].

Fourth Order Properties Part 1: Association Relations vs. Foreign Keys

Follow @DBDebunk Follow @ThePostWest

“We have Building, Room, and Bed entities. Logically, if this is in the scope of some hypothetical hotel, then each one of those entities is dependent on their parent to exist ... you cannot have a bed without a room. Also, that room wouldn't exist without its parent, Building. So, why have I rarely seen this identifying relationship introduced? When I was learning databases, everything was apparently "non-identifying". When is this type of relationship necessary, if at all? I see the issue arises when that BED can exist without a BUILDING. If you were to INSERT into the BED table, you are constraint [sic] to provide a building_id, as the building_id is part of that BED's primary key. Couldn't you avoid an identifying relationship by giving each table its own surrogate primary key? Is this the correct representation of an identifying relationship? I could avoid that by just giving each table its own ID. At the end of the day, this is about IDENTIFYING relationships, not their existence, which is how I've been logically determining if something is an "identifying relationship" If that were the case, then any 1:N relationship could be "identifying" but that's not how you define identifying or non-identifying.”

“Interesting -- I’d never heard this term before. I’ve hears it referred to as a cached ID though, as that 2nd ID isn’t required, but may be beneficial for performance purposes. For this example with 3 levels it’s not a huge joint statement, but for some systems with 12 tables the joins get unpleasant. I’ve never started a system with this additional id, but I have added one later on once the need was there and the profiling led to this being the best solution for our specific situation. Usually though, just creating a view that does the joins for me has been easier. I’ll be curious what has led others to use this approach.”

“It's not really introduced because it's way more towards academic than functional.”

--Reddit.com

Such questions, and ad-hoc terms like "identifying relationships"[1] come up because practice is driven by intuition and experience (if any), without the benefit of foundation knowledge[2]. Whether practitioners know/like it or not, a database is a formal computable representation of an informal conceptual model[3] and, therefore, data modeling (i.e., logical database design)[4] is impossible without (1) a well-defined and complete conceptual model and (2) a formal data model with which to formalize it as a logical model[5]and the two should not be confused[6]. Otherwise all bets are off.

Here's how foundation knowledge should have informed modeling and design.

What Relations Really Are and Why They Are Important

Follow @DBDebunk Follow @ThePostWest

Note: Some of the References have been re-written to bring them into line with the McGoveran formalization and interpretation [1] of Codd's real RDM -- re-reading is recommended.

Here's what's wrong with the picture of two weeks ago, namely:

"In SQL RDBMSes (such as MS SQL Server and Oracle] tables are permently stored relations, where the column names defined in the data dictionary form the "heading" and the rows are the "tuples" of the relation."

"A relation can be represented by a table in database. A relation in the context of modeling a problem will include the fields and possibly the identification of fields which have relationships with other relations..."

"Put simply, a "relation" is a table, the heading being the definition of the structure and the rows being the data."

"In simple English: relation is data in tabular format with fixed number of columns and data type of each column. This can be a table, a view, a result of a subquery or a function etc."

"A relation is a table, which is a set of data. A table is the result of a query."
--What is a relation in database terminology?, StackOverflow.com

Object Orientation, Relational Database Design, Logical Validity and Semantic Correctness

Follow @DBDebunk Follow @ThePostWest

Note: This is a 8/24/17 rewrite of a 5/20/13 post to bring it in line with McGoveran's formal exposition of Codd's RDM [1] and its correct interpretation.

08/25/17: I have added formal definitions of logical validity and semantic correctness.
09/01/17: Minor revisions.
09/02/17: Added references.
03/15/18: Minor revisions.

Here's what's wrong with last week's picture, namely:

"In my experience, using an object model in both the application layer and in the database layer results in an inefficient system. This are my personal design goals:
- Use a relational data model for storage
- Design the database tables using relational rules including 3rd normal form
- Tables should mirror logical objects, but any object may encompass multiple tables- Application objects, whether you are using an OO language or a traditional language using structured programming techniques should parallel application needs which most closely correspond to individual SQL statements than to tables or "objects". --LinkedIn.com

The Principle of Orthogonal Database Design Part III

Follow @DBDebunk Follow @ThePostWest

Note: This is a 11/24/17 re-write of Part III of a three-part series that replaced several previous posts (the pages of which redirect here), to to bring it in line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

(Continued from Part II)

POOD and SQL

As we have seen, if relations are uniquely constrained, with a true RDBMS supporting logical independence (LI) and constraint inheritance, database design can adhere to the POOD and enable DBMS-enforced consistency. A RDBMS can also support ESS explicitly.

Industry misconceptions notwithstanding, SQL DBMSs are, of course, not relational. They have weak declarative integrity support, which, coupled with bad database designs, makes adherence to the POOD (as well as the other design principles) difficult. While even its weak relational fidelity was sufficient to render SQL superior to what preceded it, this is but one example of the many advantages of the RDM that SQL has failed to concretize.

The Principle of Orthogonal Database Design Part II

Follow @DBDebunk Follow @ThePostWest

Note: This is a 11/24/17 re-write of Part II of a three-part series that replaced several previous posts (the pages of which redirect here), to bring in line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

(Continued from Part I)

To recall from Part I, adherence to the POOD means independent base relations (i.e., not derivable from other base relations), which the design example in Part I,

EMPS (EMP#,ENAME,HIREDATE)
SAL_EMPS (EMP#,ENAME,HIREDATE,SALARY)
COMM_EMPS (EMP#,ENAME,HIREDATE,COMMISSION)

violates: EMPS is derivable via union of projections of SAL_EMPS and COMM_EMPS. It requires at least:

A disjunctive constraint on each of the SAL_EMPS and COMM_EMPS relations, to ensure mutual exclusivity;
A redundancy control constraint to prevent inconsistency due to partial updates (can you formulate it?);
Use of the transaction management component of the data language to ensure that each candidate tuple is properly inserted (1) into EMPS and (2) the correct subtype relation;

POSTS

Monday, June 19, 2023

Saturday, April 8, 2023

Sunday, August 28, 2022

Saturday, August 20, 2022

Sunday, April 10, 2022

Sunday, December 5, 2021

Thursday, November 11, 2021

Wednesday, October 27, 2021

Sunday, September 19, 2021

Saturday, September 4, 2021

Tuesday, August 31, 2021

Friday, January 1, 2021

Monday, March 23, 2020

Sunday, August 25, 2019

Friday, June 14, 2019

Saturday, March 2, 2019

Sunday, November 26, 2017

Sunday, August 27, 2017

Sunday, November 20, 2016

POOD and SQL

Monday, November 7, 2016