DATABASE DEBUNKINGS: SC

Showing posts with label SC. Show all posts

Saturday, June 1, 2024

SMS: DOMAINS & SQL

I am working on entirely new papers (not re-writes) in the PRACTICAL DATABASE FOUNDATIONS series. I have already published two:

THE FIRST NORMAL FORM - A DEFINITIVE GUIDE
PRIMARY KEYS - A NEW UNDERSTANDING

available for ordering from the PAPERS page, and two more:

RELATIONAL DATABASE DOMAINS: A DEFINITIVE GUIDE
DATABASE RELATIONS: A DEFINITIVE GUIDE

are in progress and forthcoming, respectively.

In the process I am coming across common and entrenched industry "pearls" that I am using for my "Setting Matters Straight" (SMS) and "To Laugh or Cry" (TLC) posts on Linkedin. I do those posts to enable the few thinking database professionals left realize how scarce foundation knowledge is, and to illustrate fallacies that abound in the industry, of which they are unaware, and which the papers are intended to dispel.

Time permitting, I may expose and dispel some of those fallacies, treated in more depth in the papers, such that those thinking professionals can test their knowledge and decide whether the papers are a worthy educational investment.

Here's one.

“A domain in most SQL usage is essentially an alias name for an existing type + restrictions on an existing type that can be used in a column. As for an attribute, it's essentially a COLUMN in SQL, a field in other types of databases, etc.”

Can you identify the fallacies before you proceed?

Data and Meaning Part 2: Types of Business Rules

Follow @DBDebunk Follow @ThePostWest

Per Part 1, meaning is captured during conceptual modeling as information about objects of interest, specifically their properties (some of which are relationships), specified in business rules (BR). Because they are expressed informally in natural language, objects and BRs must be formalized into computable form. Data modeling (we prefer logical database design) uses a formal data model to formalize informal conceptual models as formal logical models for database representation: it assigns the meaning in the former to symbols and expressions in the latter[2]. Using the RDM:

Objects -- entities, entity groups, and multigroups -- formalize as tuples, relations, and databases, respectively;
Properties formalize as domains, and when associated with entities of specific types, as attributes;
Group and multigroup properties -- relationships among entities, and among groups[3] -- formalize as constraints on and among relations enforceable by the DBMS.

What Is a Data Model, and What It Is Not

Follow @DBDebunk Follow @ThePostWest

“The term data model is used in two distinct but closely related senses. Sometimes it refers to an abstract formalization of the objects and relationships found in a particular application domain, for example the customers, products, and orders found in a manufacturing organization. At other times it refers to a set of concepts used in defining such formalizations: for example concepts such as entities, attributes, relations, or tables. So the "data model" of a banking application may be defined using the entity-relationship "data model". This article uses the term in both senses.”

--Data Model, Wikipedia

What a True Data Model Is

Few practitioners realize that Codd invented the Relational Data Model (RDM) as the first exemplar of a data model, a concept that he formalized in 1980 as follows:

Conceptual Modeling Is Not Data Modeling

Follow @DBDebunk Follow @ThePostWest

“Ok, now that we have those two (Parts 3 and 4 of your series) 'on the table' so to speak, perhaps you would address these questions...

1. Would it be safe to say that facts expressed in a Conceptual model should be verifiable in reality?

2. Are the following facts logically equivalent or are they different:

a) The car with license number 62-JZK-6 has the color aquamarine blue
b) De auto met kenteken 62-JZK-6 heeft de kleur aquamarijnblauw

3. If a previously true fact is found in reality to be verifiably false, would that mean the Conceptual model is wrong or the Logical model, or reality?”

“I'm going to add another:

4. How does RDM handle temporal changes to the 'truth' of statement 2a) when:

a) The owner of the car paints it black.
b) The owner of the license plate legally transfers it to a truck.
c) The owner of the car replaces every single part except the chassis.”

John O'Gorman asked me these questions in a LinkedIn exchange[1] in response to my comments in another exchange on modeling[2], where I alerted to the confusion of levels of representation common in the industry, particularly conceptual-logical conflation(CLC)[3]: calling conceptual modeling data modeling both reflects and induces it.

Online exchanges are not a proper vehicle for learning, particularly foundation knowledge. Which is why I publish free blog posts, and papers and books, to which to refer interested serious data professionals. It just so happened that my just posted four-part series covers the subject at hand[4], so I referred to it, as well as other writings (the answers are already there if one cares to read them). I will not discuss the whole exchanges -- read them and judge for yourself -- but I promised to answer the questions here, where I can do them justice.

John raises primarily conceptual, not data model issues -- the latter are subservient to decisions in the former -- but then asks "how does RDM handle..." From experience, I recognize implicit doubts that the RDM can. As far as we know there is no formal data model[5] that is a superior alternative to the RDM with respect to "handling" conceptual issues (in fact, there is no other formal data model -- i.e., that satisfies Codd's definition -- period).

Since most of the issues involved are covered by McGoveran's work in progress[6] (in which my multi-part series is rooted), to ensure consistency with it I passed the questions by him. As he too pointed out, "Answers that work in all situations require highly complicated discussions and lots of time, and trying to teach someone without proper experience and educational background would be very cumbersome, or an oversimplication via online exchanges."

Here's what's possible within the constraints of a blog post -- the serious reader is referred to our writings.

DISTINCT and ORDER BY Are Not Relational

Follow @DBDebunk Follow @ThePostWest

“One of the things that confuse SQL users all the time is how DISTINCT and ORDER BY are related in a SQL query ... most people quickly understand:


SELECT DISTINCT length
FROM film

[that] returns results in an arbitrary order, because the database can (and might apply hashing rather than ordering to remove duplicates) ... Most people also understand:


SELECT length
FROM film
ORDER BY length

[that] will give us duplicates, but in order ... And, of course, we can combine the two:


SELECT DISTINCT length
FROM film
ORDER BY length

[But if] somewhat intuitively, we may want to order the lengths differently, e.g. by title:


SELECT DISTINCT length
FROM film
ORDER BY title

[m]ost databases [sic] fail this query with an exception like Oracle’s:

ORA-01791: not a SELECTed expression

At first sight ... this

SELECT length
FROM film
ORDER BY title

works after all ... So, how are these different? We have to rewind and check out the logical order of SQL operations (as opposed to the syntactic order). And always remember, this is the logical order, not the actual order executed by the optimiser.”

--How SQL DISTINCT and ORDER BY are Related, Jooq.org

Order Is For Society, Not Databases

Follow @DBDebunk Follow @ThePostWest

8/18/18: I have re-written this post for a better explanation. If you read it prior to the revision, you should re-read it.

“I learned that there is no concept of order in terms of tuples (e.g. rows) in a table, but according to wikipedia "a tuple is an ordered list of elements". Does that mean that attributes do have an order? If yes why would they be treated differently, couldn't one add another column to a table (which is why the tuples don't have order)? [OTOH], "In this notation, attribute–value pairs may appear in any order." Does this mean attributes have no order?”

--Do the “columns” in a table in a RMDB have order?

“Is it possible to reorder rows in SQL database? For example, how can I swap the order of 2nd row and 3rd row's values? The order of the row is important to me since i need to display the value according to the order [and] 'Order by' won't work for me. For example, I put a list of bookmarks in database. I want to display based on the result I get from query. (not in alphabet order). Just when they are inserted. But user may re-arrange the position of the bookmark (in any way he/she wants). So I can't use 'order by'. An example is how the bookmark display in the bookmark in firefox. User can switch position easily. How can I mention that in DB?”

--How can I reorder rows in sql database

While some data professionals may know that rows and columns of "database tables" are "unordered", few of them know what that means, and understand why. This is due to two, not unrelated, of the many common misconceptions[1] rooted in the lack of foundation knowledge in the industry, namely that relational databases consist of tables[2], and logical-physical confusion (LPC)[3]. They obscure understanding of the RDM and its practical implications, which is reflected in the answers to the above questions. Instead of debunking them, this post fills the gap in knowledge such that you can debunk them yourself -- try it before and after you read it.

Industry Practice Is No Substitute for Foundation Knowledge

Follow @DBDebunk Follow @ThePostWest

“A short time ago a colleague asked me where he could find a "Databases 101" guide for the non-technical professional. As it turns out, the internet is littered with information, and mis-information, regarding data and databases. This makes it difficult for someone entering the field of data and databases to understand the bigger picture. He wanted something to help make sense of data and databases, specifically relational versus NoSQL. So, that’s what I decided to work on. This is the start of my Databases 101 guide for the non-database professional; something to help anyone understand why the word “database” is an overloaded term. We use the term "database" to describe a great many things ... I use the terms database and database engine interchangeably in this article.”
--Databases 101, ThomasLaRock.com

To “understand the bigger picture”, and “make sense of data and databases” requires foundation knowledge. In its absence, efforts to educate defeat their own purpose, because they are grounded in the very misinformation they should be correcting. This is the mechanism by which misconceptions[1] are perpetuated and reinforced.

Understanding Relations Part 2: Beware the Misconceptions

Follow @DBDebunk Follow @ThePostWest

Note: This is a re-write of two older posts (which now link here), to bring them into line with the McGoveran formalization and interpretation of Codd's real RDM, including his own refinements, corrections, and extensions[1].

(Continued from Part 1)

I started Part 1 with comments exposing confusion (among the many misconceptions rampant in the industry[2]) of database relations with tables (reinforced by SQL), and of levels of representation. Debunking them requires foundation knowledge lacking in the industry, so I provided some: I explained the difference between mathematical relations, database relations, and relational tables. We are now in a position to debunk the comments.

Understanding Relations Part 1: Tables? So What?

“Put simply, a "relation" is a table, the heading being the definition of the structure and the rows being the data.”

“In simple English: relation is data in tabular format with fixed number of columns and data type of each column. This can be a table, a view, a result of a subquery or a function etc.”

“Practically, a "Relation" in relational model can be considered as a "Table" in actual RDBMS products(Oracle, SQL Server, MySQL, etc), and "Tuples" in a relation can also be considered as "Rows" or "Records" in a table.”

“In common usage, however, when someone refers to a "relation" in a database course, they are referring to a tabular set of data either permanently stored in the database (a table) or derived from tables according to a mathematical description (a view or a query result).”

“In SQL RDBMSes (such as MS SQL Server and Oracle] tables are permently stored relations, where the column names defined in the data dictionary form the "heading" and the rows are the "tuples" of the relation. Then from a table, a query can return a different relation.”

“Data is stored in two-dimensional tables consisting of columns (fields) and rows (records). Multi-dimensional data is represented by a system of relationships among two-dimensional tables.”

“I read [that] "Relations are multidimensional. They are not flat. They are not two dimensional. Don't let the term table mislead you." on the back cover of CJ Date's DATABASE IN DEPTH. Can anyone help how to visualize this multidimensional nature of relations?”

Because SQL DBMSs have been sold as relational databases (which they are not), and in SQL the data structure is the table, in the absence of foundation knowledge[2] most practitioners think that relational databases consist of tables, but do not ask themselves why and how is that significant for database practice. The subtitle of this post is a question I used to ask in presentations years ago that always got silence. I see no evidence of improvement -- in fact, it's gotten worse. To emulate Feynman, "Nobody understands the RDM".

That such a simple and commonly understood structure can visualize relations is an advantage of the RDM, but a table is not a relation and, SQL notwithstanding, confusing the two reflects a lack of understanding of the RDM, misses its significance for database practice, and prevents taking full advantage of its benefits.

Note: The table is the preferred way to picture relations, there are others (e.g., array).

First, the fundamentals.

Foreign Keys Part 2: Beware of Misconceptions

Follow @DBDebunk Follow @ThePostWest

Note: This is the second part of a multipart re-write of several older posts to bring them into line with the McGoveran formalization and re-interpretation of Codd's real RDM, including revisions, refinements, and extesions of his own[1].

(Continued from Part 1)

Part 1 started with an online exchange triggered by the question “Do I Have to Use Foreign Keys? If I am already manipulating data properly, are foreign keys required? Do they have another purpose that I’m just not aware of?” Both the question and the replies exhibit misconceptions about FKs (there are misconceptions about almost everything in the RDM[2]) rooted in lack of foundation knowledge, so we provided some FK fundamentals. We are now in a position to debunk the replies.

Data Modeling and NoSQL

Follow @DBDebunk Follow @ThePostWest

Revised 3/25/18

"To the question “How relevant is data modeling in the world of NoSQL?” I give the following answer.

The main purpose of data modeling is to understand the business, some application domain, some users world. The model becomes a representation of that world -- the "things" in it, the relationships among those things and any constraints on those things or relationships. A secondary purpose is to build a database to contain information which pertains to and describes that domain."

"Generally we speak of the model coming first, then the implementation, and finally, the data gets collected and stored according to the model. Hence, the business data model should not be concerned with issues of physical stored representation, or the transformations/manipulations/constraints which are imposed to facilitate implementation in some data (storage) management system. That could be a relational DBMS, or a NoSQL tool".

" ... increasingly the data already exists in some form. Which leaves us with the task of figuring out what it means, what it represents -- that is, understanding the data as it represents some user domain. NoSQL tools are often designed to deal with existing data and to process it more efficiently (that may be an oversimplification!). Either way, you must understand the business in order to make sense of the data."

--Gordon Everest, LinkedIn.com

I have written extensively on the three levels of representation and four types of model and I won't repeat it here -- readers can refresh their memory if necessary[1,2]. Everest's comments are at best ambiguous with respect to the levels and models (e.g., by data modeling he means business modeling, and his "business data model" lumps together business model and data model). It is to avoid such ambiguities and the resulting confusion that I recommend the three-fold terminology of conceptual modeling, logical database design and physical implementation, eschewing data modeling[3]. Here I will rely on my earlier writings to address strictly the issue of data modeling in the NoSQL context raised by Everest.

Physical Independence Part 2: Logical-physical Confusion

Follow @DBDebunk Follow @ThePostWest

Note: This is a rewrite of older posts (which now link here), to bring them into line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

Revised 3/17/18

(Continued from Part 1)

This is the second part of my response today to an old DBDebunk query:

"You constantly remind us that the relational model is a logical model having no connection to any physical model (so I infer). You also indicate how no commercial product fully implements the relational model. Therefore, how do we make use of the relational model when dealing with the physical constructs of a commercial database program (Oracle, Access, DB2, etc.)?" --DBDebunk.com

In Part 1 I explained physical independence (PI) and claimed that the industry has failed to internalize its importance. Here I provide evidence to that effect and discuss some consequences.

DBMS for Analytics: Risky Business Without Foundation Knowledge, Part 2

Follow @DBDebunk Follow @ThePostWest

Note: This was originally posted at AllAnalytics, which no longer exists, so some links to other posts there no longer work, but I left them in to alert the reader that I have written on those specific subjects. Other links work.

In this two-part series I alert analysts that correct interpretation and assessment of media/industry claims without being misled requires a good grasp of data fundamentals. In Part 1, I discussed the logical-physical confusion and the erroneous missclassification of DBMSs as relational and non-relational underlying the argument that the latter are superior to the former for analytics applications. In Part 2, I discuss a third misconception behind the claim.

What Relations Really Are and Why They Are Important

Follow @DBDebunk Follow @ThePostWest

Note: Some of the References have been re-written to bring them into line with the McGoveran formalization and interpretation [1] of Codd's real RDM -- re-reading is recommended.

Here's what's wrong with the picture of two weeks ago, namely:

"In SQL RDBMSes (such as MS SQL Server and Oracle] tables are permently stored relations, where the column names defined in the data dictionary form the "heading" and the rows are the "tuples" of the relation."

"A relation can be represented by a table in database. A relation in the context of modeling a problem will include the fields and possibly the identification of fields which have relationships with other relations..."

"Put simply, a "relation" is a table, the heading being the definition of the structure and the rows being the data."

"In simple English: relation is data in tabular format with fixed number of columns and data type of each column. This can be a table, a view, a result of a subquery or a function etc."

"A relation is a table, which is a set of data. A table is the result of a query."
--What is a relation in database terminology?, StackOverflow.com

Understanding Conceptual vs. Data Modeling Part 1: Data Model - The RDM Is, the E/RM Isn't

Follow @DBDebunk Follow @ThePostWest

Re-write 10/16/18

“E/RM is a data model -- So says Date, Chen, etc. So says the majority of current industry experts ... With very strong references to Codd (who he worked with), Date elegantly explains the differences between RM and E/RM -- but clearly believes both are data models (even allowing for the charitable comment). If we take a RDB as the ultimate target implementation of data, and an E/RM (or extended) can correctly design all the artifacts that are implemented, this means it is modeling the data. Granted, an E/RM does not explicitly model some of the non-structural aspects of the original Codd definition.”

“Out of interest, is there a common Relational Modeling tool, that is not also an E/RM tool and models the full Codd definition? There are also several other methods of modeling data -- E/RM is more a mechanism to represent the data. If E/RMs are used by IT professionals across the world to direct the design and build of the majority of applications guided by standard methodologies, is the view of this argument that these were all build wrongly? Regardless of success? Is the inferred conclusion that only the RM models data, and ERM, [or] any other techniques do not? [If so] that is a little limiting.”

Objects, Properties, and Ontological Commitment

We are culturally and linguistically conditioned to conceptualize the world as objects with properties. Objects in a universe thereof that share common properties are of the same type and form a class, distinguishing them from objects that are not and do not. Applying a class definition to the universe selects out the group of objects of that type from the universe.

Philosophical ontology is the study of being, existence, reality, as well as the basic categories of being, and their relationships -- what entities exist or may be said to exist, and how they may be grouped, related, and subdivided according to similarities and differences.

Note: 'Object' is used in the general, not OO sense. Ontology, as used herein, should not be confused with "computer science ontology", whereby the term ontology was usurped, and is understood by programmers as meaning a conceptual graph of directed semantic relationships among objects (and only sometimes among object types).

Conceptual modeling (1) identifies types of objects of interest, and (2) formulates business rules (BR) that specify their properties and relationships and, as such, makes an ontological commitment. Any approach to conceptual modeling must consider the ontological commitment upon which it is based, which has major implications for the data model used to formalize conceptual models as logical models for computable database representation -- it must be consistent with that commitment.

Unfortunately, due to lack of foundation knowledge in the industry[1], practitioners -- both vendors and users -- are largely unaware of, and oblivious to ontological underpinning and their implications for database technology and practice, one reason why they not only stagnated, but regressed in the last five decades. In this multipart series we explain the important distinction between conceptual, and data modeling (aka logical database design), which requires a formal data model. The E/RM is not, and while it can be used for conceptual modeling of reality, not data, we outline a new conceptual modeling approach that makes a different ontological commitment and requires adjustments to the RDM, both necessary for genuine progress.

The Importance of Understanding Classes, Sets, and Relations for Analytics

Follow @DBDebunk Follow @ThePostWest

One of the clearest indications of poor foundation knowledge in data management practice is misuse and abuse of terminology. Many data professionals are inducted into the industry without a formal education, via programming and software tools, and use terms indiscriminately, as jargon, without understanding them. This has produced weak DBMS implementations and poorly designed databases that put the correctness of databased analytics at risk).

Understanding the Division of Labor between Analytics Applications and DBMS

Follow @DBDebunk Follow @ThePostWest

I am coming across, on the one hand, instructions on how to do "analytics with SQL" and, on the other, tools purporting to enable "analytics without SQL." They are an umpteenth iteration of essentially similar ideas during my 30-plus years in data management and reflect common and entrenched fundamental misconceptions that I have documented and analyzed the costly consequences of in my writings and teachings. They will keep repeating, inhibiting genuine progress, as long as data fundamentals are ignored or dismissed. One of the least understood is the distinction between DBMS and application functions.

Database Management: No Progress Without Data Fundamentals

Follow @DBDebunk Follow @ThePostWest

I have recently -- yet again -- been accused in a LinkedIn exchange of "gibberish without any evidence" and of claiming that "nobody know what they're doing" with databases. I will leave it to readers to judge whether (1) five decades worth of writings and teaching is "no evidence" and (2) my comments in the exchange are gibberish. Here I would like to dare anybody to find claims to that effect in any of my pronouncements. What I did, do and will say is that most data professionals do not know and understand data and relational fundamentals -- an incontrovertible fact proved not just by me[1], but also by others[2,3] and that this inhibits real progress in database management.

As I wrote two weeks ago:

"The RDM put database management on a formal, scientific foot. Consequently, tool experience and relational terminology are insufficient -- foundation knowledge is necessary. Unfortunately, most data professionals do not possess it, in part because they have been misled by the industry and in part because few go through an education -- as distinct from training -- program that teaches the RDM and teaches it correctly. Consequently, even those with the heart in the right place defend the RDM without a full understanding, their views distorted by what passes for it (stay tuned for a debunking of such a recent example)."

I will now fulfill the promise by debunking just such a "heart-in-the-right-place" defense of the RDM.

Structure, Integrity, Manipulation: How to Compare Data Models

Follow @DBDebunk Follow @ThePostWest

The IT industry operates like the fashion industry: every few years -- and the number keeps getting smaller -- a "new" data technology pops up, with vendors, the trade media and various "experts" all stepping over each other to claim that it'll "revolutionize your business" and unless you jump on the bandwagon, you'll be "left behind." But time and again these prove to be fads lacking a sound foundation. Huge resources are invested in migrations from fad to fad, rather than in productive work (Don't believe the hype about Hadoop usage, Basta, Big Data It's Time to Say Arrivederci). Remember?

"Hadoop seems to take over relational database, as Hbase can store even unstructured data whereas relational data warehouse limits to structured data ... handles traditional structured data just fine, albeit in a different way than a RDBMS ... EDW vendors [will] incorporate Hadoop framework into their core architectures to enable advanced and high performance analytics."

What Meaning Means: Business Rules, Predicates, Integrity Constraints and Database Consistency

Follow @DBDebunk Follow @ThePostWest

Note: This is a 10/23/17 rewrite of a 7/29/12 post to bring it in line with the McGoveran interpretation [1] of Codd's true RDM.

To understand what's wrong with the picture of two weeks ago, namely:

"If we step back and look at what RDBMS is, we’ll no doubt be able to conclude that, as its name suggests (i.e., Relational Database Management System), it is a system that specializes in managing the data in a relational fashion. Nothing more. Folks, it’s important to keep in mind that it manages the data, not the MEANING of the data! And if you really need a parallel, RDBMS is much more akin to a word processor than to an operating system. A word processor (such as the much maligned MS Word, or a much nicer WordPress, for example) specializes in managing words. It does not specialize in managing the meaning of the words ... So who is then responsible for managing the meaning of the words? It’s the author, who else? Why should we tolerate RDBMS opinions on our data? We’re the masters, RDBMS is the servant, it should shut up and serve. End of discussion." --Alex Bunardzic, Should Database Manage The Meaning?

it helps to consider the quote in the context of another article by the author, "The Myth of Data Integrity", where he reveals that those "DBMS opinions" are constraints (the article has been deleted, but a few comments remain online and are highly recommended for a feel of the consequences of lack of foundation knowledge).

POSTS

Saturday, June 1, 2024

Tuesday, January 1, 2019

Sunday, December 2, 2018

What a True Data Model Is

Saturday, November 10, 2018

Wednesday, August 29, 2018

Wednesday, August 15, 2018

Monday, July 23, 2018

Friday, June 29, 2018

Sunday, June 24, 2018

Sunday, June 17, 2018

Saturday, March 24, 2018

Saturday, March 17, 2018

Friday, December 29, 2017

Sunday, November 26, 2017

Wednesday, November 8, 2017

Objects, Properties, and Ontological Commitment

Monday, October 30, 2017

Monday, October 2, 2017

Sunday, September 17, 2017

Tuesday, August 1, 2017

Sunday, June 11, 2017