DATABASE DEBUNKINGS: IP

Showing posts with label IP. Show all posts

Saturday, June 11, 2022

ORDER & RELATIONAL DATABASES (sms)

Note: In "Setting Matters Straight" I post on LinkedIn online Q&As that involve fundamentals under the header "What's Right and Wrong with this Database Picture" and then debunk them here. The purpose is to induce practitioners to test their foundation knowledge against our debunking, where we explain what is correct and what is fallacious. For in-depth treatments check out the POSTS and our PAPERS, LINKS and BOOKS (or organize one of our on-site/online SEMINARS, which can be customized to specific needs). Questions and comments are welcome here and on LinkedIn.

Q: “I'm not sure what this means: "The order of the rows and columns is immaterial to the DBMS?" -- could anyone explain?”

A: “It means two things:
The engine is under no obligation to insert new rows immediately following the previously inserted row(s)... During processing of selects, the optimizer is free to use any index it finds efficient to use or none at all... For this reason, if the order of returned data is important to your processing, then you must include an ORDER BY clause.”

Q: “How do you reorder fields in the database?”

A: “Depends on how you define "reorder". What view of your data are you trying to set the order. Are you in Table Design view? ... Are you looking at form? The answer is different depending on what you are referring to.”
--Quora.com

Nobody Understands the Relational Model: Semantics, Relational Closure and Database Correctness Part 2

Follow @DBDebunk Follow @ThePostWest

with David McGoveran

(Title inspired by Richard Feynman)

In Part 1 we explained that all database relations are, mathematically, relations, but not all relations are database relations, which are in both 1NF and 5NF and we agreed with a statement in a LinkedIn discussion ending as follows: "Update anomalies are not as big of a problem as an algebra where relations aren't closed under join". Unfortunately, update anomalies, closure, and how relational operators were defined are all interrelated and represent an even "bigger problem". Update anomalies are not "bugs", let alone irrelevant, but actually a reflection of that much bigger problem.

In this second part we delve into that problem.

Nobody Understands the Relational Model: Semantics, Relational Closure and Database Correctness Part 1

Follow @DBDebunk Follow @ThePostWest

with David McGoveran

(Title inspired by Richard Feynman)

“As currently defined, relational algebra produces anomalies when applied to non-5NF relations. Since an algebra cannot have anomalies, they should have raised a red flag that RA was not defined quite right, especially defining "relation" as a 1NF table and claiming algebraic closure because 1NF was preserved. Being restricted to tabular representation as the "language" for relationships is like being restricted to arithmetic when doing higher mathematics like differential calculus -- you need more expressive power, not less! Defining RA operations in terms of table manipulations aided initial learning and implementations by making data management look simple and VISUAL. Unfortunately, it was never grasped how much was missing, let alone how much more "intelligent" the RA and the RDBMS needed to be made to fix the problems. And I can see that those oversights were, in part, probably due to having to spend so much time correcting the ignorance in the industry."
--David McGoveran

I recently posted the following Fundamental Truth of the Week on LinkedIn, together with links to more detailed discussions of 1NF and 5NF (see References):

“According to conventional understanding of the RDM (such as it is) [and I don't mean SQL], a relation is in at least first normal form (1NF) -- it has only attributes drawn from simple domains (i.e., no "nested relations") -- the formal way of saying that a relation represents at the logical level an entity group from the conceptual level that has only individual entities -- no groups thereof -- as members. 1NF is required for decidability of the data sublanguage.

However, correctness, namely (1) system-guaranteed logical validity (i.e., query results follow provably from the database) and (2) by-design semantic consistency (of query results with the conceptual model) requires that relations are in both 1NF and fifth normal form (5NF). Formally, the only dependencies that hold in a 5NF relation are functional dependencies of non-key attributes on the PK -- for each PK value there is exactly one value of every corresponding non-key attribute value. This is the formal way of saying that a relation represents facts about a group of entities of a single type.

Therefore we now contend that database relations are BY DEFINITION in both 1NF and 5NF, otherwise all bets are off.”

It triggered a discussion that raised some fundamental issues for which an online exchange is too limiting. This post offers further clarifications, including comments by David McGoveran, on whose interpretation of the RDM (LOGIC FOR SERIOUS DATABASE FOLKS, forthcoming) I rely on. The portions of my interlocutor in the discussion are in quotes.

OBG: Missing Data -- Many-valued Logics and NULL Part 1

Follow @DBDebunk Follow @ThePostWest

Note: To demonstrate the correctness and stability of a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may break long pieces into multiple posts, revise, and/or add comments and references.

In response to a LinkedIn exchange we continue the series about missing data, NULL and the RDM. In Parts 1,2 and 3 we re-published a past exchange between myself and Hugh Darwen on the pros and cons of our relational solution to missing data vs. Hugh's "horizontal decomposition".

Here we re-publish my debunking of reactions to an article of mine exhibiting the common confusions evoked by NULL.

No Such Thing As "Current Relational Data Models"

Follow @DBDebunk Follow @ThePostWest

“... the concept of a state group is indeed a missing modeling concept in relational/current data models...”

Thus in a LinkedIn exchange. I don't know what a "state group" is, but I spent almost six decades debunking the misuses of data model in general and the abuses of the RDM in particular and I smell them from miles away. While the time when lack of foundation knowledge shocked me is long gone, practitioners' total unawareness of and indifference to it, and poor reasoning in a field founded on logic never ceases to amaze me.

What exactly are "relational/current data models"?

Don't Conflate/Confuse Primary Keys, PK Constraints, and Indexes

Follow @DBDebunk Follow @ThePostWest

“What is the difference between an index and a key? How are they related?”

“There seams to be some confusion between what a Primary Key is, and what an Index is and how they are used. The Primary Key is a logical object. By that I mean that is simply defines a set of properties on one column or a set of columns to require that the columns which make up the primary key are unique and that none of them are null. Because they are unique and not null, these values (or value if your primary key is a single column) can then be used to identify a single row in the table every time. In most if not all database platforms the Primary Key will have an index created on it. An index on the other hand doesn’t define uniqueness. An index is used to more quickly find rows in the table based on the values which are part of the index. When you create an index within the database, you are creating a physical object which is being saved to disk.”

“A primary key by default creates a clustered index. A unique constraint/key by default creates a non-clustered index.”

“An index is a (logically) ordered list of rows. For example, an index on LastName means all values are already sorted in LastName order. Usually index rows contain far fewer columns in them than the table itself (except the clustered index, which is the table). A key is a column or columns that defines the order of an index. For example, on an index ordered by (LastName,FirstName), then LastName and FirstName are the keys. Btw, a primary key is a physical object, not a logical one. The db engine needs physical rows in order to insure unique values in the index.”

--Difference between an index and a key?, SQLTeam.com

I have recently published a paper[1], and posted a multipart series[2] on relational keys. In the latter I stated as follows:

"As a relational feature, keys can only be properly understood within the formal foundation of the RDM, which is simple set theory (SST) expressible in first order predicate logic (FOPL) adapted and applied to database management. Yet that is precisely what is ignored and dismissed in the industry -- including by the authors of SQL[3]."

I have also written extensively on widespread logical-physical confusion (LPC)[4], recently specifically in the key-index context[5]. The replies above are examples -- if any more were needed -- that validate my repeated claim of lack of foundation knowledge in the industry -- can you tell what's wrong with, and what's correct in, them?

DISTINCT and ORDER BY Are Not Relational

Follow @DBDebunk Follow @ThePostWest

“One of the things that confuse SQL users all the time is how DISTINCT and ORDER BY are related in a SQL query ... most people quickly understand:


SELECT DISTINCT length
FROM film

[that] returns results in an arbitrary order, because the database can (and might apply hashing rather than ordering to remove duplicates) ... Most people also understand:


SELECT length
FROM film
ORDER BY length

[that] will give us duplicates, but in order ... And, of course, we can combine the two:


SELECT DISTINCT length
FROM film
ORDER BY length

[But if] somewhat intuitively, we may want to order the lengths differently, e.g. by title:


SELECT DISTINCT length
FROM film
ORDER BY title

[m]ost databases [sic] fail this query with an exception like Oracle’s:

ORA-01791: not a SELECTed expression

At first sight ... this

SELECT length
FROM film
ORDER BY title

works after all ... So, how are these different? We have to rewind and check out the logical order of SQL operations (as opposed to the syntactic order). And always remember, this is the logical order, not the actual order executed by the optimiser.”

--How SQL DISTINCT and ORDER BY are Related, Jooq.org

Order Is For Society, Not Databases

Follow @DBDebunk Follow @ThePostWest

8/18/18: I have re-written this post for a better explanation. If you read it prior to the revision, you should re-read it.

“I learned that there is no concept of order in terms of tuples (e.g. rows) in a table, but according to wikipedia "a tuple is an ordered list of elements". Does that mean that attributes do have an order? If yes why would they be treated differently, couldn't one add another column to a table (which is why the tuples don't have order)? [OTOH], "In this notation, attribute–value pairs may appear in any order." Does this mean attributes have no order?”

--Do the “columns” in a table in a RMDB have order?

“Is it possible to reorder rows in SQL database? For example, how can I swap the order of 2nd row and 3rd row's values? The order of the row is important to me since i need to display the value according to the order [and] 'Order by' won't work for me. For example, I put a list of bookmarks in database. I want to display based on the result I get from query. (not in alphabet order). Just when they are inserted. But user may re-arrange the position of the bookmark (in any way he/she wants). So I can't use 'order by'. An example is how the bookmark display in the bookmark in firefox. User can switch position easily. How can I mention that in DB?”

--How can I reorder rows in sql database

While some data professionals may know that rows and columns of "database tables" are "unordered", few of them know what that means, and understand why. This is due to two, not unrelated, of the many common misconceptions[1] rooted in the lack of foundation knowledge in the industry, namely that relational databases consist of tables[2], and logical-physical confusion (LPC)[3]. They obscure understanding of the RDM and its practical implications, which is reflected in the answers to the above questions. Instead of debunking them, this post fills the gap in knowledge such that you can debunk them yourself -- try it before and after you read it.

Foreign Keys Part 2: Beware of Misconceptions

Follow @DBDebunk Follow @ThePostWest

Note: This is the second part of a multipart re-write of several older posts to bring them into line with the McGoveran formalization and re-interpretation of Codd's real RDM, including revisions, refinements, and extesions of his own[1].

(Continued from Part 1)

Part 1 started with an online exchange triggered by the question “Do I Have to Use Foreign Keys? If I am already manipulating data properly, are foreign keys required? Do they have another purpose that I’m just not aware of?” Both the question and the replies exhibit misconceptions about FKs (there are misconceptions about almost everything in the RDM[2]) rooted in lack of foundation knowledge, so we provided some FK fundamentals. We are now in a position to debunk the replies.

A New Understanding of Keys Part 2: Kinds of Keys

Follow @DBDebunk Follow @ThePostWest

Note: This the second of three re-writes of older posts to bring them in line with McGoveran's formalization and interpretation[1] of Codd's true RDM. They are short extracts from a completely rewritten paper #4 in the PRACTICAL DATABASE FOUNDATIONS series[2] that provides a new perspective on relational keys, distinct from the conventional wisdom of the last five decades.

(Continued from Part 1)

"Many data and information modelers talk about all kinds of keys (or identifiers. I'll forego the distinction for now). I hear them talk about primary keys, alternate keys, surrogate keys, technical keys, functional keys, intelligent keys, business keys (for a Data Vault), human keys, natural keys, artificial keys, composite keys, warehouse keys or Dimensional Keys (or Data Warehousing) and whatnot. Then a debate rises on the use (and misuse) of all these keys ... The foremost question we should actually ask ourselves: can we formally disambiguate kinds of keys (at all)? Of all kinds of key, the primary key and the surrogate key gained the most discussion."

"If we take a look at the relational model we only see of one or more attributes that are unique for each tuple in a relation -- no other formal distinction is possible. When we talk about different kinds of keys we base our nomenclature on properties and behavior of the candidate keys. We formally do not have a primary key, it is a choice we make and as such we might treat this key slightly different from all other available keys in a relation. The discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys."

--Martijn Evers,dm-unseen.blogspot.com

I've deplored the misuse and abuse of terminology due a general lack of foundation knowledge in the industry [3] for longer than I care to remember, and keys are not an exception. If "the discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys", then there is no understanding of relational keys whatsoever: whatever it is, a data structure that contains NULLs is not a relation, one reason for which SQL tables are not relations, SQL databases are not relational and SQL DBMSs are not RDBMSs (for a relational solution to missing data without NULLs see[4]).

We sure can disambiguate, but the key (pun intended) to keys is that they are a relational feature and, thus, can only be properly understood within the dual theoretical foundation of the RDM, which is an adaptation and application of simple set theory (SST) expressible in first order predicate logic (FOPL) to database management. Thus, their "nomenclature on properties and behavior" should reflect what from the real world they represent, and what function they fulfill in the RDM. Which is precisely what the industry disregards.

Relational Fidelity, Cursors and ORDER BY

Follow @DBDebunk Follow @ThePostWest

Here's what's wrong with last database picture, namely:

"In a book I am reading (QUERYING SQL SERVER 2012) the author talks about theory of how databases work. He mentions relations, attributes and tuples etc. He frequently stresses the fact that some aspect of T-SQL is not relational. Like in the following excerpt:

"T-SQL also supports an object called a cursor that is defined based on a result of a query, and that allows fetching rows one at a time in a specified order. You might care about returning the result of a query in a specific order for presentation purposes or if the caller needs to consume the result in that manner through some cursor mechanism that fetches the rows one at a time. But remember that such processing isn’t relational. If you need to process the query result in a relational manner--for example, define a table expression like a view based on the query--the result will need to be relational. Also, sorting data can add cost to the query processing. If you don’t care about the order in which the result rows are returned, you can avoid this unnecessary cost by not adding an ORDER BY clause."

I would like to know, since every implementation of SQL pretty much has an ORDER BY clause which makes it non-relational, why does it even matter that (the set after ORDER BY is used) its not relational anymore since its like that everywhere? I can understand if he said it was non-standard, for example using != instead of <> for inequality because that affects portability etc., but I do not understand why something is better being relational. Please enlighten." --stackoverflow.com

Simple Domains and Value Atomicity

Follow @DBDebunk Follow @ThePostWest

09/19/23: For the latest on this subject see: FIRST NORMAL FORM - A DEFINITIVE GUIDE

11/09/22: Revised

Here's what's wrong with last week's picture, namely:

Q: "I'm currently trying to design a database and I'm not too sure about the best way to approach a dynamically sized array field of one of my objects. My first thought is to use a column in my object to store an array of integers. However the more I read, the more I think this isn't the best option. Concrete example wise, I have a player object that stores 0 to many items, which are represented by an integer. What is the best way to represent this?"

A: "If a collection of values is atomic, store them together. Meaning, if you always care about the entire group, if you never search for nested values and never sort by nested values, then they should be stored together as a single field value. If not, they should be stored in a separate table, each value bring a row, each assigned the parent ID (foreign key) of a record on the other table that "owns" them as a group. For more info, search on the term "database normalization".

Some databases, support an array as a data type. For example, Postgres allows you to define a column as a one-dimension array, or even a two dimension array. If your database does not support array as a type of column definition, transform you data collection into an XML or JSON support if your database your database supports that type. For example, Postgres has basic support for storing, retrieving, and non-indexed searching of XML using XPath. And Postgres offers excellent industry-leading support for JSON as a data type including indexed support on nested values. Going this XML/JSON route can be an exception to the normalization rules I mentioned above." --StackOverflow.com

Focus on physical implementation ("dynamically sized array field") without well-defined conceptual and logical features it is supposed to represent ("a player object" is hardly enough) and confusion of levels of representation (a real world object does not "store" anything) are always a red flag, an indication of poor grasp of foundation knowledge. So let's introduce some.

Outsmarting the DBMS: Analysts Should Beware

Follow @DBDebunk Follow @ThePostWest

Revised 5/4/2020.

Last month I alerted you to the failure by data professionals to appreciate the importance, for a variety of critical reasons, of reliance on the DBMS rather than application code for integrity enforcement and data manipulation. The following is an example of the consequences:

"If you have multiple boolean fields in a record, consider combining them into a single Integer field. For instance in a User record create a single UserType field instead of 6 separate field for IsTrainee, IsManager, IsTrainer, IsHR, IsSupplier, IsSupport. By assigning 1,2,4,8 and 16, 32 as "yes" values for these then we can say that a value of 3 in this UserType field tell us that they are both Trainee and a Manager; 36 that they are the Trainer, and they are responsible for Support. The advantage of combining these into one field is that is another type can be added (e.g., IsFirstAider=64) without adding a field."

Note: "File, "record," and "field" are physical implementation concepts. The logical design concepts are relation (visualizable as R-table), tuple (visualizable as row) and attribute (visualizable as column). By using the proper terms there is less likelihood of confusion of levels of representation rampant in the industry, which has deleterious consequences[1].

The Principle of Orthogonal Database Design Part I

Follow @DBDebunk Follow @ThePostWest

Note: This is a 11/24/17 re-write of Part I of a three-part series that replaced several older posts (the pages of which which now redirect here), to bring in line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

"The principle of orthogonal design (abbreviated POOD) ... is the second of the two principles of database design, which seek to prevent databases from being too complicated or redundant, the first principle being the principle of full normalization (POFN). Simply put, it says that no two relations in a relational database should be defined in such a way that they can represent the same facts. As with database normalization, POOD serves to eliminate uncontrolled storage redundancy and expressive ambiguity, especially useful for applying updates to virtual relations (views). Although simple in concept, POOD is frequently misunderstood ... is a restatement of the requirement that a database is a minimum cover set of the relational algebra. The relational algebra allows data duplication in the relations that are the elements of the algebra. One of the efficiency requirements of a database is that there be no data duplication. This requirement is met by the minimum cover set of the relational algebra." --Wikipedia.org

Well, not quite.

Relation Predicates and Identical Relations

Follow @DBDebunk Follow @ThePostWest

Note: This is a 11/25/17 re-write of an earlier post, to bring it in line with the McGoveral formalization and interpretation [1] of Codd's real RDM.

Here's what's wrong with the last wrong picture I posted, namely:

Q: "Can you have 2 tables, VIEWS and DOWNLOADS, with identical structure in a good DB schema (item_id, user_id, time). Some of the records will be identical but their meaning will be different depending on which table they are in. The "views" table is updated any time a user views an item for the first time. The "downloads" table is updated any time a user downloads an item for the first time. Both of the tables can exist without the other."

A1:"I don't think that there is a problem, per se. From a E/R modeling point of view I don't see a problem with that, as long as they represent two semantically different entities."

A2:"Are you saying that both tables have an 'item_id' Primary Key? In this case, the fields have the same name, but do not have the same meaning. One is a 'view_id', and the other one is a 'download_id'. You should rename your fields consequently to avoid this kind of misunderstanding."

A3: "Chris Date and Dave McGoveran formalised the Principle of Orthogonal Design. Roughly speaking it means that in database design you should avoid the possibility of allowing the same tuple in two different relvars. The aim being to avoid certain types of redundancy and ambiguity that could result."

A4: "When designing a DB there are lots of different parameters, and some (e.g.: performance) may take precedence. Case in point: even if the structures (and I suppose indexing) are identical, maybe "views" has more records and will be accessed more often. This alone could be a good reason not to burden it with records from the downloads." --StackOverflow.com

Duplicates: Stating the Same Fact More Than Once Does Not Make it Truer, Only Redundant

Follow @DBDebunk Follow @ThePostWest

Here's what what wrong with last week's picture, namely:

RB: "From the tabular point of view, does it make sense why we can't have duplicate rows in a relation?"

John Sullivan: "As with everything else in life, it depends what you are trying to do (and exactly what you mean when you talk about a DBMS table v. a formal relation). From an operational (transactional) database point of view, for obvious reasons, you don't want duplicate rows (enforce a natural key). But if you're analysing data from various legacy sources (e.g. spreadsheets) it might be useful. Then again, you might introduce a surrogate key to give you more control over what's going on - again, depends on what you are trying to do." --LinkedIn.com

One of my readers once wondered why "database professionals understand uniqueness via keys, but don't seem to understand why duplicate rows should be prohibited and the consequences of breaking relational closure":

First Normal Form in Theory and Practice Part 3

Follow @DBDebunk Follow @ThePostWest

09/19/23: For the latest on this subject see: FIRST NORMAL FORM - A DEFINITIVE GUIDE

Note: This is a 11/23/17 revision of Part 3 of a three-part series that replaced all of my previous posts on the subject (pages of which redirect here), in order to further tighten integration with the formalization and interpretation [1] of McGoveran's formalization and interpretation [1] of Codd's true RDM.

(Continued from Part 2)

"Is this table in 1NF?" is a common question in database practice. On the other hand, "What problems are solved by splitting street addresses into individual columns?", or What's the best way to store an array in a relational database does not seem to evoke associations with 1NF. This reveals poor foundation knowledge.

Part 1 introduced the poor understanding of 1NF and Part 2 provided a correct definition and explanation. Part 3 explains how 1NF can be enforced by the data sublanguage, which SQL does not.

First Normal Form in Theory and Practice Part 1

Follow @DBDebunk Follow @ThePostWest

09/19/23: For the latest on this subject see: FIRST NORMAL FORM - A DEFINITIVE GUIDE

Note: This is a 11/23/17 revision of Part 1 of a three-part series that replaced all of my previous posts on the subject (pages of which redirect here), in order to further tighten integration with the McGoveran formalization and interpretation [1] of Codd's true RDM.

On the one hand:

"... there is no generally accepted definition of 1NF ... the word that you see most often is 'atomic'. It is common to say that a relation is in 1NF if all its attributes [sic] are atomic ... Does 1NF equate to “atomic attribute [values]”? ... what [do] people have in mind [when they claim] atomicity? ... the [meaning] behind definitions is that you should rarely need to extract information from a value of an attribute ... But that explains why one cannot decide, depending on theory only, whether a relation is in 1NF ... it is a habitual use of data that makes attributes atomic, not theory. No wonder, there is so much mess in theory about what 1NF should be."

--What is the actual definition of First Normal Form, Vertabelo.com

On another, according to a DBDebunk reader:

"Codd in 1969/70 (and RM V/2 20 years later) gave a precise, theory-based definition of "atomic" aka "simple" aka "non-decomposable" (later aka non-"compound" aka non-"structured"): not relation-valued. And he gave a precise definition of "normalized" (1NF): relations free of relation-valued-domains (RVD)!"

All sorts of other definitions proliferate, for example:

"First normal form enforces these criteria:

- Eliminate repeating groups in individual tables.
- Create a separate table for each set of related data.
-Identify each set of related data with a primary key."
--First normal form, Wikipedia

Note: A relational databases consists of relations that can be visualized as R-tables. Normal forms are a property of relations, not R-tables -- a "R-table in 1NF" is shorthand for consistency with the underlying relation.

Re-Write

Follow @DBDebunk Follow @ThePostWest

Understanding Codd's 12 Rules for RDBMS

Sunday, February 1, 2015

Re-write

Follow @DBDebunk Follow @ThePostWest

See series starting at

The First Normal Form in Theory and Practice Part I

POSTS

Saturday, June 11, 2022

Thursday, November 11, 2021

Wednesday, October 27, 2021

Saturday, January 9, 2021

Thursday, May 28, 2020

Tuesday, September 18, 2018

Wednesday, August 29, 2018

Wednesday, August 15, 2018

Sunday, June 17, 2018

Sunday, April 22, 2018

Sunday, August 13, 2017

Sunday, February 19, 2017

Tuesday, January 31, 2017

Monday, September 19, 2016

Saturday, September 3, 2016

Monday, July 25, 2016

Tuesday, April 19, 2016

Tuesday, March 1, 2016

Sunday, September 20, 2015

Sunday, February 1, 2015