DATABASE DEBUNKINGS: BR

Showing posts with label BR. Show all posts

Monday, June 19, 2023

PREDICATE LOGIC, SEMANTICS AND RDM (sms)

Note: In "Setting Matters Straight" posts I debunk online pronouncements that involve fundamentals which I first post on LinkedIn. The purpose is to induce practitioners to test their foundation knowledge against our debunking, where we explain what is correct and what is fallacious. For in-depth treatments check out the POSTS and our PAPERS, LINKS and BOOKS (or organize one of our on-site/online SEMINARS, which can be customized to specific needs). Questions and comments are welcome here and on LinkedIn.

“As I have said many times, if the original relational model had been based on predicate logic and also the semantics and rules of definitions we'd all be better off now. It wasn't. Full stop.”
--Ronald Ross, LinkedIn.com

Assessing such arguments normally requires clarification of what exactly is meant by "the relational model". Ross does refer specifically to the "original" -- which we take to mean that introduced by Codd in 1969-70 -- but given the massive misuse and abuse in the industry, perceptions of it may well be corrupted (Nobody Understands the Relational Model Semantics, Relational Closure and Database Correctness). Moreover, there are many predicate logic (PL) systems and many ways of categorizing them (1st vs n-th order being only one way) -- we assume Ross means RDM is based on none.

CONCEPTUAL MODELING, LOGICAL DATABASE DESIGN AND PHYSICAL IMPLEMENTATION (sms)

Follow @DBDebunk Follow @ThePostWest

“A conceptual data model usually just includes the main concepts (entities) required to store information and the relationships that exist between these entities. We don’t usually include any details about each piece of information. We can consider the conceptual stage as an initial model, without all the details required to create a database.

A logical data model is probably the most-used data model. It goes beyond the conceptual model; it includes entities, relationships, details on entities’ different attributes, and unique ways to identify entities (primary keys) and establish the relationships between them (foreign keys).

A physical data model is usually derived from a logical data model for a particular relational database management system (RDBMS), thus taking into account all technology-specific details. One big difference between logical and physical data models is that we now need to use table and column names rather than specifying entity and attribute names. This allows us to adapt to the limits and conventions of the desired database engine. We also provide the actual data types and constraints that allows us to store the desired information.”

--Vertabelo.com

CONCEPTUAL BUSINESS RULES AND LOGICAL CONSTRAINTS (sms)

Follow @DBDebunk Follow @ThePostWest

What's right/wrong about this database picture?

“Other than constraints on cardinality, business rules are not generally represented on data models of either kind. Even in the case of business data models, the models are supposed to represent fundamental structures, while business rules represent variable constraints.”

--TDan.com

OBG: Database Consistency and Physical Truth

Follow @DBDebunk Follow @ThePostWest

Note: To demonstrate the correctness and stability due to a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may slightly revise, break into parts, and/or add comments and/or references.

This is an email exchange with a reader responding to my third book.
(Originally posted on 06/21/2001)

“I'm presently reading your book PRACTICAL ISSUES IN DATABASE MANAGEMENT and there are a couple of points that I find a little confusing. I'll start first by saying that I have no formal database oriented education, and I'm attempting to familiarize myself with some of the underlying theories and practices, so that I can further my personal education and career prospects (but aren't we all!). My questions may sound a little bit ignorant, but that would be because I am! (Please note ignorant, not stupid!) I'll quote you directly from the book for this (possibly I'm taking you out of context or missing something important)

Chapter 3, A Matter of Identity: Keys, pg. 75: "Databases represent assertions of fact - propositions - about entities of interest in the real world. The representation must be correct - only true propositions (facts) must be represented."

Now, correct me if I'm wrong with a basic assumption here, but isn't a database simply a model of a "real world" data collection? I would've thought that the intention of a database would be to model real life effectively (and accurately) enough to provide useful data for interpretation. Now obviously this is not an easy process with complex data types, but would it even be possible to have a 100% true proposition with only atomic data types? (i.e. can a simplified model contain only facts?) In my understanding of modeling, any model that fits real life closely enough to be a good statistical representation is a usable model. e.g. Newton's Laws are accurate enough when applied on a local scale, but we need to use Einstein's model of space-time across larger scales. Wouldn't recording only "facts" (which I would presume you mean to be statements that are provable in the objective sense i.e. no interpretation, only investigation or calculation) possibly eliminate the utility of some aspects of the database? Or do we account for the interpretative aspect in the metadata or in some other way?

Essentially, I can see what you're saying, but not necessarily how you've reached the conclusion. Admittedly in an ideal world we should be able to record only facts in a database, but this is not an ideal world. As an example, in surveys we see such questions as "Are you happy with this product?" followed by a rating system of 1-5, or 'completely unhappy to completely happy'. This is an artificial enforcement of a quantitative measure on a qualitative property. How do we account for the fact that this is interpreted data and not calculated or measured?

My questions may have little relevance to database theory in general, but the concept fascinates me!”

Understanding Relational Constraints

Follow @DBDebunk Follow @ThePostWest

“The data in a relational database is stored in form of a table. A table makes the data look organized. Yet in some cases we might face issues while working with the data like repetition. We might want enforce rules on the data to avoid such technical problems. Theses rules are called constraints. A constraint can be defined as a rule that has to enforced on the data to avoid faults. There are three kinds of constraints: entity, referential and semantic constraints. Listed below are the differences between these three constraints:
1. Entity constraints -- primary key, foreign key, unique, NULL -- are posed within a table and used to enforce uniqueness and to define no value [respectively].
2. Referential constraints -- foreign key -- are enforced with more than one table for referring other tables for analysis of the data.
3. Semantic constraints -- datatypes -- are enforced in a table on the values of a specific attribute and help the data segregate according to its type. Example: name varchar2(30).”
--GeeksforGeeks.com

Before we tackle the main subject, let's get some misconceptions out of the way. As we have explained so many times:

Data is not "stored in a form of a table" -- it can be stored in any number of physical formats, at the discretion of DBMS designers and DBAs. Physical independence is a core advantage of the RDM.
A table does not "make the data look organized". Data is by definition organized -- be it relationally or not -- otherwise it would be random noise not data. A database relation can be visualized as a R-table, but tables do not play any role in RDM.
While some "repetition" (i.e., redundancy) is prevented by constraints (e.g., uniqueness), others are avoided by database design (e.g., 5NF DB relations).

And now to constraints.

OBG: Don't Confuse Levels of Representation Part 1

Follow @DBDebunk Follow @ThePostWest

Note: To demonstrate the correctness and stability due to a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may revise, break into parts, and/or add comments and/or references.

This is an email exchange with readers in response to my article Normalization and Performance: Never the Twain Shall Meet.

Normalization and Further Normalization Part 3: Understanding Database Design

Follow @DBDebunk Follow @ThePostWest

Note: This is a re-write of two older posts, to bring them into line with McGoveran's formalization, re-interpretation, and extension[1] of Codd's RDM.

In Part 1 we explained that for a database to be relational, database design must adhere to three core principles, in which case it consists of relations that are by definition in both 1NF and 5NF. In Part 2 we showed that whether tables visualize relations (i.e., are R-tables) can be determined only with reference to the conceptual model that the database designer intended the database to represent (not what any users might think it does). This is obscured by the common and entrenched confusion/conflation of levels of representation and, consequently, of types of model -- conceptual, logical, physical, and data model -- that we have so often debunked[2].

Fourth Order Properties Part 1: Association Relations vs. Foreign Keys

Follow @DBDebunk Follow @ThePostWest

“We have Building, Room, and Bed entities. Logically, if this is in the scope of some hypothetical hotel, then each one of those entities is dependent on their parent to exist ... you cannot have a bed without a room. Also, that room wouldn't exist without its parent, Building. So, why have I rarely seen this identifying relationship introduced? When I was learning databases, everything was apparently "non-identifying". When is this type of relationship necessary, if at all? I see the issue arises when that BED can exist without a BUILDING. If you were to INSERT into the BED table, you are constraint [sic] to provide a building_id, as the building_id is part of that BED's primary key. Couldn't you avoid an identifying relationship by giving each table its own surrogate primary key? Is this the correct representation of an identifying relationship? I could avoid that by just giving each table its own ID. At the end of the day, this is about IDENTIFYING relationships, not their existence, which is how I've been logically determining if something is an "identifying relationship" If that were the case, then any 1:N relationship could be "identifying" but that's not how you define identifying or non-identifying.”

“Interesting -- I’d never heard this term before. I’ve hears it referred to as a cached ID though, as that 2nd ID isn’t required, but may be beneficial for performance purposes. For this example with 3 levels it’s not a huge joint statement, but for some systems with 12 tables the joins get unpleasant. I’ve never started a system with this additional id, but I have added one later on once the need was there and the profiling led to this being the best solution for our specific situation. Usually though, just creating a view that does the joins for me has been easier. I’ll be curious what has led others to use this approach.”

“It's not really introduced because it's way more towards academic than functional.”

--Reddit.com

Such questions, and ad-hoc terms like "identifying relationships"[1] come up because practice is driven by intuition and experience (if any), without the benefit of foundation knowledge[2]. Whether practitioners know/like it or not, a database is a formal computable representation of an informal conceptual model[3] and, therefore, data modeling (i.e., logical database design)[4] is impossible without (1) a well-defined and complete conceptual model and (2) a formal data model with which to formalize it as a logical model[5]and the two should not be confused[6]. Otherwise all bets are off.

Here's how foundation knowledge should have informed modeling and design.

Class, Type, Set, Relvar, and Relation

Follow @DBDebunk Follow @ThePostWest

Note: This is a rewrite of a part of an older post (now redirecting here), to bring into line with McGoveran's formalization, re-interpretation, and extension of Codd's RDM[1] (the rewrite of the other part was posted last week).

“[According to Date] relvar ≠ class. [But i]n simple terms, class applies to a collection of values allowed by a predicate, regardless of whether such a collection could actually exist. Every set has a corresponding class, although a class may have no corresponding set ... in mathematical logic, a relation is a class (and trivially also a set), which contributes to confusion.”

“In modern programming parlance, class is generally distinguished from type only in that the latter refers to primitive (system-defined) data definitions, while class refers to higher-level (user-defined) data definitions. This distinction is almost arbitrary, and in some contexts, type and class are actually synonymous.”

Class, type, and set are often used interchangeably in the industry. Relations are neither class, nor type, and Date's relvars must be placed properly in their formal context. While details regarding these concepts vary with the flavor of set theory, they are sufficiently well defined to be distinguishable in each of the three formal foundations of the RDM, simple set theory (SST), mathematical relation theory, and first order predicate logic (FOPL).

Understanding Domains and Attributes

Follow @DBDebunk Follow @ThePostWest

Note: This is a rewrite of one section of an older post (page thereof now links here), to bring it into line with McGoveran's formalization, re-interpretation, and extension of Codd's RDM[1]. The rewrite of the other part will be posted next.

“I don't understand the concepts of domain and attribute in relational database modeling. Can someone give me an effective example?”

“Domain is an overloaded word in the DB lexicon. It probably should also be avoided. When one refers to an attribute domain in practice it is only referring to columns that have a check constraint on them that limit the values. Reference tables with foreign key constraints in general also fulfill the spirit of what domain attributes do outside of an RDBMS.”

“A domain in most SQL usage is essentially an alias name for an existing type + restrictions on an existing type that can be used in a column. As for an attribute, it's essentially a COLUMN in SQL, a field in other types of databases, etc.”

To the extent that practitioners are familiar with domains, they equate them with programming data types (PDT), or, at best, with SQL data types.

Test your foundation knowledge -- are domains the same as PDTs or SQL data types?

Data and Meaning Part 4: Query and Result Correctness

Follow @DBDebunk Follow @ThePostWest

As we have seen in Parts 1, 2, and 3, the RDM is a formal theory adapted and applied to database management: database relations (1) preserve the formal properties of mathematical relations, but also (2) have interpretations -- carry a real world meaning assigned by a conceptual model: facts about entities, entity groups, and multigroups (i.e., their properties, some of which are relationships, specified by business rules (BR)). A relation is formally in 5NF and constrained for semantic consistency (i.e., to represent facts about an entity group).

“When we create specific domains, relations, and attributes we are constraining (restricting) an abstract logical system to a specific interpretation (meaning). Seen the other way around, an interpretation of the logical system is a representation of a specific segment of the world, and that is exactly the purpose of database design. For example, an attribute name created by the designer is assigned meaning intended by the modeler as representing an entity property, which is the very meaning of semantics. That is why full normalization cannot be achieved or assessed without reference to some conceptual model -- what attribute names mean, and how they are related to each other (i.e., their dependencies), and so on.” --David McGoveran

Yet requesting and giving design advice without a conceptual model is routine in the industry[1]. What is more, most practitioners are oblivious to the implications for correctness of queries and results[2].

Data and Meaning Part 3: Database Design

Follow @DBDebunk Follow @ThePostWest

We have seen in Part 2 that the meaning of data in a database is the conceptual model that the database is intended to represent, namely (1) the three types of objects -- entities of multiple types that form entity groups that form a multigroup -- and (2) the business rules (BR) that specify their properties:

Properties in context (PiC) shared by entities of each type;
Collective group properties (i.e., relationships among entity group members);
Multigroup properties (i.e., inter-group relationships).

Often somebody produces one or more tables and asks if there's "anything wrong" with them, or "if they are in some specific normal form and, if not, how to normalize them". This reflects lack of foundation knowledge.

Data and Meaning Part 2: Types of Business Rules

Follow @DBDebunk Follow @ThePostWest

Per Part 1, meaning is captured during conceptual modeling as information about objects of interest, specifically their properties (some of which are relationships), specified in business rules (BR). Because they are expressed informally in natural language, objects and BRs must be formalized into computable form. Data modeling (we prefer logical database design) uses a formal data model to formalize informal conceptual models as formal logical models for database representation: it assigns the meaning in the former to symbols and expressions in the latter[2]. Using the RDM:

Objects -- entities, entity groups, and multigroups -- formalize as tuples, relations, and databases, respectively;
Properties formalize as domains, and when associated with entities of specific types, as attributes;
Group and multigroup properties -- relationships among entities, and among groups[3] -- formalize as constraints on and among relations enforceable by the DBMS.

Understanding Conceptual vs. Data Modeling Part 3: Don't Conflate Reality and Data

Follow @DBDebunk Follow @ThePostWest

In Part 1 and Part 2 we explained that between 1975-81, when the E/RM and RDM were introduced, there was no distinction between an informal conceptual and a formal logical level. In 1980, however, Codd defined a formal data model and in the later 80s the conceptual-logical-physical levels of representation emerged. If applied to the two models:

Only the RDM satisfies the definition;
The E/RM can be used at the conceptual level to model reality, the latter can be used to model data at the logical level (i.e., formalize conceptual models as logical models for database representation).

Current practitioners, however, continue to confuse levels of representation and confuse/conflate types of model. So much so, that in my presentations I used to draw an imaginary line dividing the room into two sections, and move to the right section to discuss one level/model, and to the left section to discuss another.

Consider the question "does data modeling slow down an application development process?". I will set aside the notion of "speeding up" application development by skipping altogether "data modeling" (whichever way it is meant), and focus on the response.

RE-WRITE

Follow @DBDebunk Follow @ThePostWest

See: https://www.dbdebunk.com/2018/09/designation-property-and-assertion.html

Sunday, July 15, 2018

Understanding Relations Part 3: Debunking Conventional Wisdom

Follow @DBDebunk Follow @ThePostWest

(See Part 1 and Part 2)

“A common term used in database design is a "relational database" -- but a database relation is not the same thing and does not imply, as its name suggests, a relationship between tables. Rather, a database relation simply refers to an individual table in a relational database. In a relational database, the table is a relation because it stores the relation between data in its column-row format. The columns are the table's attributes, while the rows represent the data records. A single row is known as a tuple to database designers.”

“A relation, or table, in a relational database has certain properties.”

“First off, its name must be unique in the database, i.e. a database cannot contain multiple tables of the same name.”

“Next ... as with the table names, no attributes can have the same name.”

“Next, no tuple (or row) can be a duplicate. In practice, a database might actually contain duplicate rows, but there should be practices in place to avoid this, such as the use of unique primary keys (next up). Given that a tuple cannot be a duplicate, it follows that a relation must contain at least one attribute (or column) that identifies each tuple (or row) uniquely. This is usually the primary key. This primary key cannot be duplicated. This means that no tuple can have the same unique, primary key. The key cannot have a NULL value, which simply means that the value must be known.”

“Further, each cell, or field, must contain a single value. For example, you cannot enter something like "Tom Smith" and expect the database to understand that you have a first and last name; rather, the database will understand that the value of that cell is exactly what has been entered.”

“Finally, all attributes—or columns—must be of the same domain, meaning that they must have the same data type. You cannot mix a string and a number in a single cell.”

“All these properties, or constraints, serve to ensure data integrity, important to maintain the accuracy of data.” --Definition of Database Relation

It is easy to discern when explanations of relational features are not grounded in the formal foundations of the RDM[1], but in industry practices. Here are some further clarifications and corrections.

Understanding Relations Part 1: Tables? So What?

Follow @DBDebunk Follow @ThePostWest

Note: This is a re-write of two older posts (which now link here), to bring them into line with the McGoveran formalization and interpretation of Codd's real RDM, including his own refinements, corrections, and extensions[1]

“Put simply, a "relation" is a table, the heading being the definition of the structure and the rows being the data.”

“In simple English: relation is data in tabular format with fixed number of columns and data type of each column. This can be a table, a view, a result of a subquery or a function etc.”

“Practically, a "Relation" in relational model can be considered as a "Table" in actual RDBMS products(Oracle, SQL Server, MySQL, etc), and "Tuples" in a relation can also be considered as "Rows" or "Records" in a table.”

“In common usage, however, when someone refers to a "relation" in a database course, they are referring to a tabular set of data either permanently stored in the database (a table) or derived from tables according to a mathematical description (a view or a query result).”

“In SQL RDBMSes (such as MS SQL Server and Oracle] tables are permently stored relations, where the column names defined in the data dictionary form the "heading" and the rows are the "tuples" of the relation. Then from a table, a query can return a different relation.”

“Data is stored in two-dimensional tables consisting of columns (fields) and rows (records). Multi-dimensional data is represented by a system of relationships among two-dimensional tables.”

“I read [that] "Relations are multidimensional. They are not flat. They are not two dimensional. Don't let the term table mislead you." on the back cover of CJ Date's DATABASE IN DEPTH. Can anyone help how to visualize this multidimensional nature of relations?”

Because SQL DBMSs have been sold as relational databases (which they are not), and in SQL the data structure is the table, in the absence of foundation knowledge[2] most practitioners think that relational databases consist of tables, but do not ask themselves why and how is that significant for database practice. The subtitle of this post is a question I used to ask in presentations years ago that always got silence. I see no evidence of improvement -- in fact, it's gotten worse. To emulate Feynman, "Nobody understands the RDM".

That such a simple and commonly understood structure can visualize relations is an advantage of the RDM, but a table is not a relation and, SQL notwithstanding, confusing the two reflects a lack of understanding of the RDM, misses its significance for database practice, and prevents taking full advantage of its benefits.

Note: The table is the preferred way to picture relations, there are others (e.g., array).

First, the fundamentals.

The Importance of Understanding Classes, Sets, and Relations for Analytics

Follow @DBDebunk Follow @ThePostWest

One of the clearest indications of poor foundation knowledge in data management practice is misuse and abuse of terminology. Many data professionals are inducted into the industry without a formal education, via programming and software tools, and use terms indiscriminately, as jargon, without understanding them. This has produced weak DBMS implementations and poorly designed databases that put the correctness of databased analytics at risk).

Database Management: No Progress Without Data Fundamentals

Follow @DBDebunk Follow @ThePostWest

I have recently -- yet again -- been accused in a LinkedIn exchange of "gibberish without any evidence" and of claiming that "nobody know what they're doing" with databases. I will leave it to readers to judge whether (1) five decades worth of writings and teaching is "no evidence" and (2) my comments in the exchange are gibberish. Here I would like to dare anybody to find claims to that effect in any of my pronouncements. What I did, do and will say is that most data professionals do not know and understand data and relational fundamentals -- an incontrovertible fact proved not just by me[1], but also by others[2,3] and that this inhibits real progress in database management.

As I wrote two weeks ago:

"The RDM put database management on a formal, scientific foot. Consequently, tool experience and relational terminology are insufficient -- foundation knowledge is necessary. Unfortunately, most data professionals do not possess it, in part because they have been misled by the industry and in part because few go through an education -- as distinct from training -- program that teaches the RDM and teaches it correctly. Consequently, even those with the heart in the right place defend the RDM without a full understanding, their views distorted by what passes for it (stay tuned for a debunking of such a recent example)."

I will now fulfill the promise by debunking just such a "heart-in-the-right-place" defense of the RDM.

Don't Confuse/Conflate Database Consistency with Truth

Follow @DBDebunk Follow @ThePostWest

Disregard for foundation knowledge and failure to learn from past mistakes by even data professionals deemed experts inhibit progress in data management and bring back problems already resolved that should be of foremost concern to data analysts. Consider the following:

"Above all else, we count on databases to reflect the truth consistently, or at least to reflect the table data perfectly. The database cannot be blamed when an application (or the end users of an application) place inaccurate data in its tables, but a database must accurately report the data it holds. Therefore, bugs are not all created equal; there are bugs, and there are wrong-rows bugs, bugs that silently misrepresent the data that the tables hold. Even the craziest, most obscure corner case that potentially misrepresents your data should rightly bring a loud chorus: "The emperor has no clothes!" We depend on the database, above all, not to lie."

POSTS

Monday, June 19, 2023

Saturday, February 4, 2023

Sunday, January 22, 2023

Friday, November 5, 2021

Saturday, September 4, 2021

Thursday, March 25, 2021

Friday, June 14, 2019

Saturday, March 2, 2019

Saturday, February 16, 2019

Sunday, February 10, 2019

Saturday, January 19, 2019

Wednesday, January 9, 2019

Tuesday, January 1, 2019

Sunday, October 28, 2018

Tuesday, September 11, 2018

Sunday, July 15, 2018

Sunday, June 24, 2018

Monday, October 30, 2017

Sunday, September 17, 2017

Friday, September 1, 2017