DATABASE DEBUNKINGS: FD

Showing posts with label FD. Show all posts

Tuesday, January 21, 2025

Revision 1 (6/25) of CONCEPTUAL MODELING FOR DATABASE DESIGN - A Sound Guide

Table of Contents

Introduction
1 Information Representation
2 Conceptual Modeling
2.1 Ontological Commitment
2. 2 Properties and Relationships
3. Entity Properties
3.1 First Order Properties
3.2 Assertion Predicates
3.3 Second Order Properties
4 Group Properties
4.1 Third Order Properties
4.1.1 Entity Uniqueness
4.1.2 1OP (in Context) Dependencies
4.1.3 Aggregates Relationships
4.1.4 Meaning Criteria and ESS Relationships
4.1.5 Designation “Property”
5 Multigroup Fourth Order Properties
5.1 Inter-group Entity Relationships
5.2 Inter-group Aggregates Relationships
6. Business Rules
6.1 Entity Type Rules
6.2 Group Type Rules
6.3 Multigroup Type Rules
Conclusion
Appendix: PoM/OCP and RDM

Sunday, April 30, 2023

RELATIONSHIPS AND THE RDM V2 Part 3: SEMANTIC CONSTRAINTS

Follow @DBDebunk Follow @ThePostWest

Note: This is a multipart re-write of a previous series that, when completed, is intended to replace it.

In Part 1 we documented the differences between mathematical and database relations (see table in Part 1). We attributed the fallacy that the RDM can express only one type of relationship -- between relations using FKs -- to the industry being unaware of the adaptation of math relations for database management. We intimated that some of the additional features of database relations express relationships other than between relations.

In Part 2 we identified the intra-group c-relationships (and the corresponding within-relation l-relationships) in our approach to conceptual modeling:

Properties-entities relationships

- general dependencies

Properties Relationships
Entities Relationships

- entity uniqueness
- functional dependencies (FD)
- entity supertype-subtypes relationships

and used a simple conceptual model (CM) of six entity groups to illustrate them:

Customers (cID, cname, FICO, discount)
Products (pID, pname, price)
Salesmen (sID, sname, sales, salary, commission)
Orders (oID, pID, cID, sID, date, amount)
Order Items (oID, iID, pID, quantity)

Database design is the use of a data model (DM) (here, RDM) to formalize conceptual models (CM) -- including c-relationships -- as logical models (LM) for database representation, so it must be able to convert the business rules (BR) that express those relationships in specialized natural language at the conceptual level to formal constraints in a FOPL-based data sublanguage at the logical level.

Our intention is to demonstrate that the RDM can express all c-these relationships, but we face a difficulty.

RELATIONSHIPS & THE RDM V2 PART 2: INTRA-GROUP RELATIONSHIPS

Follow @DBDebunk Follow @ThePostWest

Note: This is a multi-part complete re-write of a previous series which, when completed, is intended to replace it.

In Part 1 we attributed the fallacy that the RDM can express only one type of relationship -- between relations, using FKs -- to practitioners being unaware of the adaptation of math relations to database management and missing the additional features of database relations. We documented the differences in features between math and database relations (see the table in Part 1) and intimated that some of the additional features express relationships other than those between relations using FKs (which we leave out in this discussion).

In this Part 2 we identify the c-relationships and use a simple conceptual model (CM) of six entity groups:

Customers (cID, cname, FICO, discount)
Products (pID, pname, price)
Salesmen (sID, sname, sales, salary, commission)
Orders (oID, pID, cID, sID, date, amount)
Order Items (oID, iID, pID, quantity)

to illustrate them (to recall, we prepend 'relationship' with c- and l- when we use the term at the conceptual and logical levels, respectively).

NOBODY UNDERSTANDS FURTHER NORMALIZATION 4 (sms)

Follow @DBDebunk Follow @ThePostWest

Note: In "Setting Matters Straight" posts I debunk online pronouncements that involve fundamentals which I first post on LinkedIn. The purpose is to induce practitioners to test their foundation knowledge against our debunking, where we explain what is correct and what is fallacious. For in-depth treatments check out the POSTS and our PAPERS, LINKS and BOOKS (or organize one of our on-site/online SEMINARS, which can be customized to specific needs). Questions and comments are welcome here and on LinkedIn.

In Part 3 we set the matter straight about normalization to 1NF. In this part we do it wit respect to further normalization to 5NF. Non-1NF relations (i.e., with relation-valued attributes) are no longer part of industry practice, so we focus on 2NF-5NF violations. The term further normalization originates with Codd, who initially thought 1NF was sufficient and 2NF-5NF were discovered later (hence, further = beyond 1NF). The industry lumps both under normalization, but the two are distinct (e.g., only further normalization involves redundancy).

What's right/wrong with the following?

“So, what is this theory of normal forms? It deals with the mathematical construct of relations (which are a little bit different from relational database tables). First, second, and third normal forms are the basic normal forms in database normalization. Normalization in relational databases is a design process that minimizes data redundancy and avoids update anomalies. Basically, you want each piece of information to be stored exactly once; if the information changes, you only have to update it in one place. The normalization process consists of modifying the design through different stages, going from an unnormalized set of relations (tables), to the first normal form, then to the second normal form, and then to the third normal form.”
--Vertabelo.com

NOBODY UNDERSTANDS NORMALIZATION 2 (sms)

Follow @DBDebunk Follow @ThePostWest

(Continued from Part 1)

What's right/wrong about this database picture?

“So, what is this theory of normal forms? It deals with the mathematical construct of relations (which are a little bit different from relational database tables). The normalization process consists of modifying the design through different stages, going from an unnormalized set of relations (tables), to the first normal form, then to the second normal form, and then to the third normal form.”

--Vertabelo.com

Misconceptions

All database relations are, mathematically, relations, but not all mathematical relations are database relations.
The tabular structure play practically no role in RDM.
In practice there is no normalization (to 1NF) and there should not be further normalization (to 5NF).
Further normalization does not go from 2NF sequentially through 3NF and 4NF to 5NF.

NOBODY UNDERSTANDS NORMALIZATION 1 (sms)

Follow @DBDebunk Follow @ThePostWest

What's right/wrong with this database picture?

“Normalization in relational databases is a design process that minimizes data redundancy and avoids update anomalies. Basically, you want each piece of information to be stored exactly once; if the information changes, you only have to update it in one place. The theory of normal forms gives rigorous meaning to these informal concepts. There are many normal forms. In this article, we’ll review the most basic:
First normal form (1NF)
Second normal form (2NF)
Third normal form (3NF)
There are normal forms higher than 3NF, but in practice you usually normalize your database to the third normal form or to the Boyce-Codd normal form, which we won’t cover here.”

--Vertabelo.com

DATABASE DESIGN: THE STATE OF KNOWLEDGE IN THE INDUSTRY

Follow @DBDebunk Follow @ThePostWest

Can you identify all the fallacies and misconceptions in the following online exchange? What is the elephant in the room?

Q: “I have done data normalization on dummy data and would like to know if I did it correctly. If it is done correctly, I would also like to ask two things below, because it is about 3NF.

1NF: This table should be 1NF.

2NF: I selected composite key (userID and Doors) as they represent minimal candidate key and got three tables applying FD rule.

3NF: Applying the rule of transitive dependency on 1st table in 2NF, I got out 4 tables (showing only first two, because the last two remain unchanged).

Questions: Is this database normalisation correct? If not could you point me where I did mistake? If answer on first question is True: Should the last table in 3NF be transformed into two tables, given it is not in correct Third normal form. Two non-key atributes have FD keycode -> accessGroup.”

NOBODY UNDERSTANDS DATABASE DESIGN 1 (sms)

Follow @DBDebunk Follow @ThePostWest

In a previous SMS post I debunked an attempt to express something important about database practice that was handicapped by lack of foundation knowledge. Here is another example.

“This Codd guy might have been onto something. Unfortunately, normalization is usually taught in a somewhat backwards, overly technical way. If you start with concepts, connections between them and details about them, you usually are already at a fairly high normal form without going through any formal normalization steps.”
--LinkedIn.com

TYFK: Calculated Attributes -- Redundancy, Full Normalization and Relational Theory

Follow @DBDebunk Follow @ThePostWest

Note: Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can review references -- reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date -- which explain and correct the misconceptions. You can acquire further knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

“If you have shopping cart, you probably have some field "TOTAL" somewhere that stores the final amount due for the customer. It so happens that such a thing violates relational theory...”

“Having a "TOTAL" field in your "order" table *might* violate relational theory, but if you make it so that only a trigger can update it based on what's in your "order_item" table, then I think it's fine. You still get data integrity and that is what matters.”

“I still fail to see what you mean by the "calculated TOTALS field" (attribute, really) violates the Relational Model.”

“The result of having the field ... is what is called a DELETE ANOMALY.”

“Most denormalizing means adding columns to tables that provide values you would otherwise have to calculate as needed.”

“There are four practical problems with a fully normalized database, three of which I have listed before. I will list them all here for completeness:
* No calculated values. Calculated values are a fact of life for all applications, but a normalized database lacks them. The burden of providing calculated values must be taken up by somebody somehow. Denormalization is one approach to this, though there are others.”
--Database Programmer blog

“...I'm now working with IT to normalize part of the database to remove calculated fields...:
`lineitems`.`extended total` = `lineitems`.`units` * `biditems`.`price`.
`jobs`.`jobvalue` = the sum of related `lineitems`.`extended total` records
`orders`.`ordervalue` = the sum of related `jobs`.`jobvalue` records.”
--mySQL.com

Do calculated attributes (not fields!) violate relational theory and must be "normalized" out of them? Determining that requires foundation knowledge that is scarce in the industry, which has a poor and outdated understanding of the RDM.

TYFK: Normalized, Fully Normalized, Non-Normalized, Denormalized -- Clearing the Mess

Follow @DBDebunk Follow @ThePostWest

“A non-normalized database is a disorganized one, where nobody has bothered to work out where the facts should be stored. It is like a stack of paper files that has been tossed down the stairs. We are not interested in non-normalized databases.

A normalized database has been organized so that each fact is stored in exactly one place (2nf and greater) and no more than one fact is stored in each place (1nf). In a normalized database there is a place for everything and everything is in its place.

A denormalized database is a normalized database that has had redundancies deliberately re-introduced for some practical gain. Most denormalizing means adding columns to tables that provide values you would otherwise have to calculate as needed. Values are copied from table to table, calculations are made within a row, and totals, averages and other aggregrations are made between child and parent tables.”
--database-programmer.blogspot.com

RE-WRITE

Follow @DBDebunk Follow @ThePostWest

See: https://www.dbdebunk.com/2023/08/entities-properties-and-codds-sleight.html

Thursday, August 20, 2020

TYFK: Relations, Tables, Domains and Normalization

Follow @DBDebunk Follow @ThePostWest

Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, which is based on the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can acquire the knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

“The most popular data model in DBMS is the Relational Model. It is more scientific a model than others. This model is based on first-order predicate logic and defines a table as an n-ary relation. The main highlights of this model are:

Data is stored in tables called relations.

Relations can be normalized. In normalized relations, values saved are atomic values.

Each row in a relation contains a unique value.

Each column in a relation contains values from a same domain.”

OBG: Data Independence and "Physical Denormalization"

Follow @DBDebunk Follow @ThePostWest

Note: I am re-publishing some of the articles and reader exchanges from the old DBDebunk (2000-06). How well do they hold up -- have industry knowledge and practice progressed? Judge for yourself and appreciate the difference between a sound foundation and the fad-driven cookbook approach.

January 2, 2001

"... one of the "4 great lies" is "I denormalize for performance." You state that normalization is a logical concept and, since performance is a physical concept, denormalization for performance reasons is impossible (i.e., it doesn't make sense). What term would you use to describe changing the physical database design to be different from the logical design to enhance performance? Because normalization is a logical concept, you imply that this is not called denormalization."

TYFK: 5NF, Association Relations and Join

Follow @DBDebunk Follow @ThePostWest

Assume a conceptual model of a multigroup consisting of two related entity groups, Customers and Orders, where a customer can issue multiple orders. The conventional logical database design is:

CUSTOMERS
===============================================
| CID | NAME     | AGE | ADDRESS   | SALARY   |
-=====-----------------------------------------
|   1 | Ramesh   | 32 | Ahmedabad | 2000.00 |
|   2 | Khilan   | 25 | Delhi     | 1500.00 |
|   3 | Kaushik | 23 | Kota      | 2000.00 |
|   4 | Chaitali | 25 | Mumbai    | 6500.00 |
|   5 | Hardik   | 27 | Bhopal    | 8500.00 |
|   6 | Komal    | 22 | MP        | 4500.00 |
|   7 | Muffy    | 24 | Indore    | 10000.00 |
-----------------------------------------------

ORDERS
===================================
| OID | DATE       | CID | AMOUNT |
-=====-----------------------------
| 102 | 2009-10-08 |   3 |   3000 |
| 100 | 2009-10-08 |   3 |   1500 |
| 101 | 2009-11-20 |   2 |   1560 |
| 103 | 2008-05-20 |   4 |   2060 |
-----------------------------------

where ORDERS.CID is an "embedded" foreign key (FK) referencing the primary key (PK) CUSTOMERS.CID.

Consider the query "For all orders, find the CID, name, OID, amount, and date" that applies a join of the two relations on CID. In SQL:

SELECT c.cid,c.name,o.oid,o.amount,o.date
FROM customers c
INNER JOIN orders o
ON c.cid = o.cid;

with the result displayed by the table:

====================================================
| C.CID | C.NAME   | O.OID | O.AMOUNT | O.DATE     |
-=======------------=======-------------------------
|     2 | Khilan   |   101 |     1560 | 2009-11-20 |
|     3 | Kaushik |   102 |     3000 | 2009-10-08 |
|     3 | Kaushik |   100 |     1500 | 2009-10-08 |
|     4 | Chaitali |   103 |     2060 | 2008-05-20 |
----------------------------------------------------

Note: A table is just a tabular display of a relation and the two should not be confused[1,2]. Bear in mind that SQL tables are not relations.

It may surprise you to know that both the design and the result are problematic from a relational standpoint.

Meaning Criteria and Entity Supertype-Subtypes Relationships

Follow @DBDebunk Follow @ThePostWest

Note: This is a re-write of a previous post.

"I have a database for a school ... [with] numerous tables obviously, but consider these:

CONTACT - all contacts (students, faculty): has fields such as LAST, FIRST, ADDR, CITY, STATE, ZIP, EMAIL;
FACULTY - hire info, login/password, foreign key to CONTACT;
STUDENT - medical comments, current grade, foreign key to CONTACT."
"Do you think it is a good idea to have a single table hold such info? Or, would you have had the tables FACULTY and STUDENT store LAST, FIRST, ADDR and other fields? At what point do you denormalize for the sake of being more practical? What would you do when you want to close out one year and start a new year? If you had stand-alone student and faculty tables then you could archive them easily, have a school semester and year attached to them. However, as you go from one year to the next information about a student or faculty may change. Like their address and phone for example. The database model now is not very good because it doesn’t maintain a history. If Student A was in school last year as well but lived somewhere else would you have 2 contact rows? 2 student rows? Or do you have just one of each and have a change log. Which is best?"

How would somebody who "does not know past, or new requirements, modeling, and database design" and messes with a working database just because "he heard something about (insert your favorite fad here)" figure out correct from bad answers? Particularly if the answers suffer from the same lack of foundation knowledge as the question?

Normalization and Further Normalization Part 1: Databases Representing ... What?

Follow @DBDebunk Follow @ThePostWest

Note: This is a re-write of older posts (which now link here), to bring them into line with the McGoveran formalization, re-interpretation, and extension[1] of Codd's RDM.

“A particular bug-bear and a mistake that +90% of "data modelers" make, is analyzing "point in time" views of the business data and "normalizing" those values hence failing to consider change over time and the need to reproduce historic viewpoints. Let’s say we start with this list of data-items for a Sales-Invoice (completely omitting details of what’s been sold):

SALES-INVOICE
{Invoice-Date,
Customer-Account-ID,
Customer Name,
Invoice-Address-Line-1,
Invoice-Address-Line-2,
Invoice-Address-Line-3,
Invoice-Address-Line-4,
Invoice-Address-Postcode,
Net-Amount,
VAT,
Total-Amount
};

Nearly every time, through the blind application of normalization we get this ... there’s even a term for it -- it’s called "over-normalization":

SALES-INVOICE
{Invoice-Date,
Customer-Account-Id
REFERENCES Customer-Account,
Net-Amount,
VAT,
Total-Amount
};

CUSTOMER-ACCOUNT
{Customer-Account-Id,
Customer-Name,
Invoice-Address
REFERENCES Address
};

ADDRESS
{Address-Line-1,
Address-Line-2,
Address-Line-3,
Address-Line-4,
Postcode
};”

A measure of scarcity of foundation knowledge in the industry are the attempts to correct a plethora of common misconceptions[2] that suffer from the very misconceptions they aim to correct. One of the most common fallacies is confusion of levels of representation[3] that takes two forms[4]. We have written extensively about the logical-physical confusion (LPC)[5,6,7,8] underlying "denormalization for performance"[9], and the conceptual-logical conflation (CLC) that lumps conceptual with data modeling[10,11,12], inhibiting understanding that the latter is formalization of the former.

Understanding Data Modeling Part 3: OO/UML, and "Graph Data Models"

Follow @DBDebunk Follow @ThePostWest

In Part 1 we presented some foundation knowledge with which to debunk misconceptions lurking in the industry's "data modeling" mess that Friesendal has tried to catalog. In Part 2 we applied this knowledge to the first two modeling approaches considered by Friesendal, the E/RM and RDM. We apply it here to other two, OO/UML and "GDM".

Object Orientation and Unified Modeling Language

“A "counter revolution" against the relational movement was attempted in the 90’s. Graphical user interfaces came to dominate and they required advanced programming environments. Functionality like inheritance, sub-typing and instantiation helped programmers combat the complexities of highly interactive user dialogs. The corresponding Data Modeling tool is the Unified Modeling Language ...”

Understanding Data Modeling Part 2: "E/RM" and "RDM"

Follow @DBDebunk Follow @ThePostWest

In Part 1 we presented some foundation knowledge with which to debunk misconceptions lurking in the industry's modeling mess that Friesendal has tried to map. We now proceed to apply it to the various industry "data models" considered by Friesendal, and his understanding thereof. In this part, we apply this knowledge to the first two industry "data models" considered by Friesendal -- the E/RM and RDM.

"Entity-Relationship Model"

“One of the first formal attempts at a framework for Data Modeling was the Entity-Relationship data model paradigm proposed [in 1976] by Peter Chen. Notice that in the original Chen-style, the attributes are somewhat independent and the relationships between entities are named and carry cardinalities ("how many" participants in each end of the relationship) ... Attributes are related to their "owner" entity" in what other people called "functional dependencies".”

Data and Meaning Part 3: Database Design

Follow @DBDebunk Follow @ThePostWest

We have seen in Part 2 that the meaning of data in a database is the conceptual model that the database is intended to represent, namely (1) the three types of objects -- entities of multiple types that form entity groups that form a multigroup -- and (2) the business rules (BR) that specify their properties:

Properties in context (PiC) shared by entities of each type;
Collective group properties (i.e., relationships among entity group members);
Multigroup properties (i.e., inter-group relationships).

Often somebody produces one or more tables and asks if there's "anything wrong" with them, or "if they are in some specific normal form and, if not, how to normalize them". This reflects lack of foundation knowledge.

Data and Meaning Part 2: Types of Business Rules

Follow @DBDebunk Follow @ThePostWest

Per Part 1, meaning is captured during conceptual modeling as information about objects of interest, specifically their properties (some of which are relationships), specified in business rules (BR). Because they are expressed informally in natural language, objects and BRs must be formalized into computable form. Data modeling (we prefer logical database design) uses a formal data model to formalize informal conceptual models as formal logical models for database representation: it assigns the meaning in the former to symbols and expressions in the latter[2]. Using the RDM:

Objects -- entities, entity groups, and multigroups -- formalize as tuples, relations, and databases, respectively;
Properties formalize as domains, and when associated with entities of specific types, as attributes;
Group and multigroup properties -- relationships among entities, and among groups[3] -- formalize as constraints on and among relations enforceable by the DBMS.

POSTS

Tuesday, January 21, 2025

Sunday, April 30, 2023

Sunday, April 16, 2023

Friday, December 2, 2022

Sunday, October 23, 2022

Misconceptions

Saturday, October 8, 2022

Monday, September 12, 2022

Sunday, August 28, 2022

Sunday, September 19, 2021

Tuesday, August 31, 2021

Thursday, June 10, 2021

Thursday, August 20, 2020

Monday, July 20, 2020

Saturday, November 30, 2019

Sunday, August 25, 2019

Friday, May 31, 2019

Sunday, April 28, 2019

Object Orientation and Unified Modeling Language

Saturday, April 20, 2019

"Entity-Relationship Model"

Wednesday, January 9, 2019

Tuesday, January 1, 2019