Sunday, June 19, 2016

This Week



1. What's wrong with this picture?

This week's picture is the one of the state of knowledge about keys that Martijn Evers painted in Kinds of Keys: On the Nature of Key Classifications, that I had already commented on. As a result of discussions I've been having with David McGoveran in the context of our forthcoming books (his LOGIC FOR SERIOUS DATABASE FOLKS, my DBDEBUNK GUIDE TO FUNDAMENTAL DATA MANAGEMENT MISCONCEPTIONS), I've decided to rewrite my
comments On Kinds of Keys on the subject. I refer the reader to Martijn's article for a refresher--my rewrite will be posted next week.

2. Quote of the Week

There are no rules of normalization for non-relational databases. Effectively, you start out by denormalizing everything. Which means you're designing the data organization to serve specific queries. So follow the same principle in NoSQL databases as you would for denormalizing a relational database: design your queries first, then the structure of the database is derived from the queries. --Bill Karwin, What is a good way to design a NoSQL database

Sunday, June 12, 2016

Levels of Representation: Conceptual Modeling, Logical Design and Physical Implementation



From last week:

What's wrong with this picture? (Kinds of Data Models, LinkedIn.com)

David Hay: "Part of the ... confusion as to what exactly was meant by “data modeling”--conceptual, logical or physical--is that most data modeling activities seem to focus on achieving good relational database designs ... my approach is the portrayal of the underlying structure of an enterprise’s data--without regard for any technology that might be used to manage it ... a “conceptual data model” ... that represents the business."
Nothing raises uncertainty whether to laugh or cry better than attempts to dispel confusion which suffer from the very confusion they purport to dispel.

Sunday, June 5, 2016

This Week



1. What's wrong with this picture?
David Hay: Part of the ... confusion as to what exactly was meant by “data modeling”--conceptual, logical or physical--is that most data modeling activities seem to focus on achieving good relational database designs ... my approach is the portrayal of the underlying structure of an enterprise’s data--without regard for any technology that might be used to manage it ... a “conceptual data model” ... that represents the business.

Nigel Higgs: ... many folks do not get the difference between the Barker entity relationship style of modeling and the relational style of modeling ... [because] the modeling conventions are very similar and the former [is always] a precursor to RDBMS design.

Clifford Heath: Any terminology for models must project three aspects of intention: (a) audience, (b) level of detail and (c) purpose. These three variables are sufficient to discriminate all the main kinds of models in use. The traditional terms of "conceptual/logical/physical" are manifestly inadequate.

Remy Fannader: Models are meant to describe sets of instances (objects or behaviors).
--Kinds of Data Models, LinkedIn.com

2. Quote of the Week

The first consideration that needs to be made when selecting a database is the characteristics of the data you are looking to leverage. If the data has a simple tabular structure, like an accounting spreadsheet, then the relational model could be adequate. Data such as geo-spatial, engineering parts, or molecular modeling, on the other hand, tends to be very complex. It may have multiple levels of nesting and the complete data model can be complicated. Such data has, in the past, been modeled into relational tables, but has not fit into that two-dimensional row-column structure naturally. --Jnan Dash, RDBMS vs. NoSQL: How do you pick?

Wednesday, May 25, 2016

Why Data Scientists Must Understand Normalization



My May post @All Analytics:

We are constantly told how data scientists must be “jacks of many skills”, but one of the most important is rarely included in the list.Very few databases are properly designed. Many SQL databases are denormalized inadvertently, or intentionally (and erroneously) "for performance". They require special constraints to control data redundancy and prevent inconsistencies, which are practically never enforced. Analysts cannot, therefore, take database consistency for granted. Furthermore, to issue sensible queries and ensure correct results and interpretation thereof, it’s not enough for analysts to know the types of fact represented in the database, but also whether and how the database designer has chosen to bundle -- nest or merge -- those facts and how to disentangle them for analysis.

Read it all. (Please comment there, not here)



 



Sunday, May 22, 2016

This Week (& a Change in Format)



I am changing the format of the posts on DBDebunk.

"This Week" posts every other week will continue to be followed by posts on data and relational fundamentals, but with a twist: each of the latter posts will explain the preceding week's "What's wrong with this picture?"Here's the first post in the new format

1. What's wrong with this picture?

"Data is stored in two-dimensional tables consisting of columns (fields) and rows (records). Multi-dimensional data is represented by a system of relationships among two-dimensional tables. This usually leads to data storage becoming redundant, and also difficult to maintain on account of addition and deletion anomalies. This is only the case if we do not normalize the data. Keys are fields or combinations of fields used to identify records." --Nigel Peck, Denormalization Summary

Sunday, May 15, 2016

Weak Entities, Referential Constraints and the Conceptual-Logical Conflation



Note: This is a rewrite of a 09/07/14 post.

A LinkedIn exchange initiated by the question What is a weak entity? diversified into various aspects of data modeling and database design, some of which was contaminated by the conceptual-logical conflation.

"Peter Chen (E/R Modeling, 1976) used the term "weak" entity to describe one which could not meaningfully have an independent existence. An example might be an Order which requires a customer (and a Product or set of products). A weak entity need not be a composite, as in your OrderItem example. The central issue here is dependency of one entity [of one] type on another [of another type]. Furthermore, there could be more that one such dependency. In your example, the OrderItem would be dependent on BOTH Order and Item.

It is important to note that this dependency will NOT be enforced by a simple RDBMS unless you also define referential integrity on both parts of the composite key, notwithstanding that some RDBMS's (e.g., SQL Server) couple the definition of the relationship with a referential integrity constraint. In other words, you cannot have a defined relationship without enforcing referential integrity.

This is not always desirable. Consider the task of bulk loading some records which have a relationship with some other entity type(s). The only practical way to do this is to first turn off referential integrity enforcement which means deleting the definition of the relationship entirely. After completing the loading of data, if you reinstate the defined relationship, will the system automatically ensure that referential integrity is not violated? Good question."
--Gordon Everest
Relationships in the E/RM are at the conceptual level. Dependencies of entities of one class/type on those of another are one kind of relationships defined as business rules in a conceptual model. Referential constraints are their formal representation at the logical level of a relational database. So it's not that "some" SQL DBMS's (true RDBMS's do not exist) "couple the definition of the relationship with a referential integrity constraint", the constraints are precisely how a RDBMS (what exactly is a "simple RDBMS"?) enforces business rules: a defined relationship is enforced as a referential constraint in a relational database. How does Everest propose to enforce business rules without constraints?

There are no circumstances in which FK constraints "are not desirable" and deactivating them is not the the solution to bulk loads, staging tables are. For example:

"In Oracle you can instruct the DBMS to log violating rows into an 'exceptions' table while enabling a constraint (be it, a primary key, unique key, foreign key, or plain check constraint). This fits the bill better, particularly if it's possible to INSERT...SELECT them after corrections." --Toon Koppelaars
"SQL Server will check on re-enabling the constraint that it is not violated by the new state of the database. If it is, it only reports that an error occurred. It has a real time violation catalog to easily map a constraint name to a specific violation and transaction id. In other words, doing bulk batch constraint validation is part of the product." --William Sisson and Racim Boudjakdji
Much of the criticism of the RDM is rooted in ignorance of data and relational fundamentals and the history of database management.
"How to implement“existence dependence” ... between two entities ...is a design decision and is dependent upon the architecture of the implementation technology. [Chen's E/R models] ... were completely technology agnostic and had no “relational” flavor at all. Why? With the rapid emergence and popularity of the NoSQL, NewSQL, Graph Databases, etc. no longer can relational implementation and FK be assumed and we again need to separate the “conceptual data model” from the “logical data model”. Whether this is evolution of de-evolution is a whole other subject (for those of us that remember WHY relational technology emerged in the 1980s--and as a warning to those who would forget their history--IDMS, IMS, etc.)"

The relational model has been so dominant for so long, we have forgotten our roots and the reason that there were originally three layers of data model, not just two. The rise of the post-relational technologies has again required their creation and reminds us of their utility.
--David Tryon
Data model agnosticism has nothing to do with the “emergence/popularity of non-relational technologies” (NoSQL, NewSQL did not exist when Chen introduced the E/RM), it is  is inherent in conceptual modeling--which is what E/RM is used for--by definition. Conceptual models are expressed in real world terms--facts about classes of property-sharing entities. To be represented in a database, a conceptual model must be formalized as a logical model expressed in database terms. That's when a data model comes into play, to provide abstract data structure, integrity and manipulation (relation, constraints and relational algebra in the relational case). The conceptual and  logical were always separate, but as Everest comments show, levels of representation are constantly confused. 

Incidentally
  • A logical model is implemented in specific hardware and DBMS software, a conceptual model is formalized as a logical model by means of a data model (e.g., the RDM). Using 'implementation' for both induces and reinforces the confusion.
  • Emergence, or popularity are not grounds for technology adoption: a superior generality-to-simplicity ratio (G2SR), soundness and flexibility are. 
If only there were relational dominance. Relational technology was treated by the industry as just any other "hot" fad and was never understood and implemented.  Instead we got SQL DBMS's, whose even poor relational fidelity proved superior to the hierarchic and network products that they replaced, but did not confer all the advantages of true RDBMS's.

Do away with the RDM and constraints and you're back to relying on application developers for database functions, particularly integrity enforcement and optimization, without guaranteed logical and semantic correctness--clearly de-evolution.







View My Stats