1. Database truth of the week
"Relvars introduce a concept of assignment, which has no counterpart in either FOPL or set theory. If you add it to those formalisms you introduce computational completeness, which destroys both decidability (the existence of a general algorithm by which you can determine if an expression is or is not logically valid) and the guarantee that there exists a (query) evaluation procedure that will halt (the existence of a general algorithm by which you can evaluate the truth or falsity of every instantiated predicate expression given those instantiations from any given database). Therefore we must forbid relvars." --David McGoveran
2. What's wrong with this database picture?
"Many
data and information modelers talk about all kinds of keys (or
identifiers. I'll forego the distinction for now). I hear them talk
about primary keys, alternate keys, surrogate keys, technical keys,
functional keys, intelligent keys, business keys (for a Data Vault),
human keys, natural keys, artificial keys, composite keys, warehouse
keys or Dimensional Keys (or Data Warehousing) and whatnot. Then a
debate rises on the use (and misuse) of all these keys ... The foremost
question we should actually ask ourselves: can we formally disambiguate
kinds of keys (at all)? Of all kinds of key, the primary key and the
surrogate key gained the most discussion."
"If
we take a look at the relational model we only see of one or more
attributes that are unique for each tuple in a relation -- no other
formal distinction is possible. When we talk about different kinds of
keys we base our nomenclature on properties and behavior of the
candidate keys. We formally do not have a primary key, it is a choice we
make and as such we might treat this key slightly different from all
other available keys in a relation. The discussion around primary keys
stems more from SQL NULL problems, foreign key constraints and
implementing surrogate keys. --Martijn Evers, Kinds of Keys: On the Nature of Key Classifications, dm-unseen.blogspot.com
"I have to maintain some lists in DB (SQLServer, Oracle, DB2, Derby), I have 2 options to design underlying simple table:
"1st:
NAME VALUE
=================
dept HR
dept fin
role engineer
role designer
-----------------
UNIQUE CONSTRAINT (NAME, VALUE) and some other columns like auto generated ID, etc.
"2nd:
NAME VALUE_JSON_CLOB
==================================
dept {["HR", "fin"]}
role {["engineer", "designer"}]
----------------------------------
UNIQUE CONSTRAINT (NAME) and some other columns like auto generated ID, etc.
"There is no DELETE operation, only SELECT and INSERT/UPDATE. In first advantage is only INSERT is required but SELECT (fetch all values for a given NAME) will be slow. In second SELECT will be fast but UPDATE will be slow. By considering there could be 10000s of such lists with 1000s for possible values in the system with frequent SELECTs and less INSERTs, which TABLE design will be good in terms of select/insert/update performance." --SQL TABLE to store lists of strings, StackOverflow.com
Using a relational database to "maintain lists" probably does not merit attention and I actually considered canceling the debunking of this example. But it provides an opportunity to demonstrate the gap between conventional wisdom, database practice and SQL DBMSs and Codd's true RDM, as formalized and interpreted by McGoveran [1]. Such use is induced by lack of foundation knowledge, so for the purpose of this discussion I treat the example as a case of "how not to think when performing database design".
Note: Certainly logical database design should not be contaminated with physical implementation considerations such as performance [2].
1. Database truth of the week
"ALL
names are human created, either by non-algorithmic assignment, or via
some algorithm. We ONLY know that two types of objects are distinct
because they have different sets of defining properties and, for a given
object type, we ONLY know that two objects are distinct because the
values (observed or measured) of that object type's defining properties
are distinct. Names (of objects of some type) allow us to distinguish
two such entities ONLY when they are 1:1 with the values of the object
defining properties. Two sets of names (whether human assigned or
machine generated) consistently identify the same set of entities ONLY
when they are 1:1." --David McGoveran
2. What's wrong with this database picture?
"I have to maintain some lists in DB (SQLServer, Oracle, DB2, Derby), I have 2 options to design underlying simple table:
"1st:
NAME VALUE
=================
dept HR
dept fin
role engineer
role designer
-----------------
UNIQUE CONSTRAINT (NAME, VALUE) and some other columns like auto generated ID, etc.
"2nd:
NAME VALUE_JSON_CLOB
==================================
dept {["HR", "fin"]}
role {["engineer", "designer"}]
----------------------------------
UNIQUE CONSTRAINT (NAME) and some other columns like auto generated ID, etc.
"There
is no DELETE operation, only SELECT and INSERT/UPDATE. In first
advantage is only INSERT is required but SELECT (fetch all values for a
given NAME) will be slow. In second SELECT will be fast but UPDATE will
be slow. By considering there could be 10000s of such lists with 1000s
for possible values in the system with frequent SELECTs and less
INSERTs, which TABLE design will be good in terms of
select/insert/update performance." --SQL TABLE to store lists of strings, StackOverflow.com
Note:
This was originally posted at AllAnalytics, which no longer exists,
so some links to other posts there no longer work, but I left them in to
alert the reader that I have written on those specific subjects. Other
links work.
In this two-part series I alert analysts that correct interpretation and
assessment of media/industry claims without being misled requires a
good grasp of data fundamentals. In Part 1, I discussed
the logical-physical confusion and the erroneous missclassification of
DBMSs as relational and non-relational underlying the argument that the
latter are superior to the former for analytics applications. In Part 2,
I discuss a third misconception behind the claim.
The following posts have been re-written to bring in line with the McGoveran formalization and interpretation of Codd's true RDM. Re-reading is strongly recommended.
"But the core Information Principle (IP) of the RDM mandates that all information in a relational database be represented explicitly and in exactly one way -- as values
of relation attributes defined on domains. The difference between
relation names is, thus, meaningful information, the representation of
which violates the IP and the RDM, for which reason it is inaccessible
to the DBMS: consider the candidate tuple {v1,v2}
-- it is impossible for the DBMS to know to which relation it belongs
based on the relation and attribute names because it does not understand
semantics!" Database Design Relation Predicates and “Identical Relations”
"Some
set defining properties are formed as the disjunction of two or more
properties (a kind of relationship between two common properties). These
disjuncts, taken together, are meaning criteria. Each meaning
criterion (an individual disjunct) induces a partitioning of a set into
two subsets, those that meet the criterion and those that do not.
Alternatively, we can say that each meaning criterion serves to
differentiate a possible subset of a set from other subsets of the set
(some of the possible subsets will be disjoint, while others not). Each
of the possible subsets of the set is then defined by (“inherits”):
The defining properties of the set conjoined with at least one meaning
criterion (that or those becoming the defining property, or properties,
respectively, specific to the proper subset)." Meaning Criteria and Entity Supertype-Subtypes
"Although they are no longer used, inquiries about them persist and with the current proliferation of non-relational products (e.g., NoSQL, graph DBMSs) there is value in understanding them. The closest the industry came to implementing the RDM is SQL which, despite its poor relational fidelity, proved much superior relative to the complexity and inflexibility of preceding DBMSs. But the rules still expose poor relational fidelity of SQL DBMS's that have not been addressed for four decades, while new RDM violations were introduced.
We offer here our clarifications on the rules. For each rule, we:
- Explain its intended objective;
- Offer clarifications, some of which reflect our current understanding of the RDM -- distinct from conventional wisdom -- based on its dual theoretical foundation and a careful analysis of Codd's work;" --Interpreting Codd's 12 Rules
1. Database truth of the week
"Within the database field, it is common to refer to three “level” of description: conceptual, logical, and physical. Both the logical level and the physical level are formal systems. By contrast, the conceptual level is typically an informal system and refers to the subject of the database.
The conceptual language is a subject language, in the terminology of formal systems. The conceptual level identifies the concepts to be formally represented by the logical and physical levels, and how users think and talk about those concepts. This level corresponds only informally to the so-called “conceptual schema” of earlier approaches to information management, which emphasized the capture of conceptual information using various techniques including diagrams and documentation having various degrees of formality, but not forming a strictly formal system themselves." -- David McGoveran
2. What's wrong with this database picture?
I re-wrote two older debunkings to bring them in line with the McGoveran formalization and interpretation of Codd's true RDM. Re-reads are recommended.
"Can you have 2 tables, VIEWS and DOWNLOADS, with identical structure in a good DB schema (item_id, user_id, time). Some of the records will be identical but their meaning will be different depending on which table they are in. The "views" table is updated any time a user views an item for the first time. The "downloads" table is updated any time a user downloads an item for the first time. Both of the tables can exist without the other ..."
"I have a database for a school ... [with] are numerous tables obviously but consider these:
CONTACT - all contacts (students, faculty) has fields such as LAST, FIRST, MI, ADDR, CITY, STATE, ZIP, EMAIL;
FACULTY - hire info, login/password for electronic timesheet login, foreign key to CONTACT;
STUDENT - medical comments, current grade, foreign key to CONTACT.
Do you think it is a good idea to have a single table hold such info? Or, would you have had the tables FACULTY and STUDENT store LAST, FIRST, ADDR and other fields? ..."
Here's what's wrong with last week's picture, namely:
"Per Date’s AN INTRODUCTION TO DATABASE SYSTEMS, Date & Darwen’s DATABASES, TYPES, AND THE RELATIONAL MODEL, and related references, the features of a relational database are values, types, attributes, tuples, relations, relation-valued variables, operators, and constraints.
- A type is a set of values and related operators.
- An attribute is a name, value, type triple.
- A tuple is a set of attributes.
- A relation is a set of tuples with a given heading.
- A relation-valued variable (known as a relvar) is a persistent variable whose time-varying value is a relation." --Dave Voorhis, Computer scientist; lead developer of Rel, a true relational database system, Quora.com
This is more or less the conventional wisdom, which is nothing like the true RDM envisioned by Codd [1].