Sunday, September 19, 2021

TYFK: Calculated Attributes -- Redundancy, Full Normalization and Relational Theory



Note: Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can review references -- reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date -- which explain and correct the misconceptions. You can acquire further knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

“If you have shopping cart, you probably have some field "TOTAL" somewhere that stores the final amount due for the customer. It so happens that such a thing violates relational theory...”

“Having a "TOTAL" field in your "order" table *might* violate relational theory, but if you make it so that only a trigger can update it based on what's in your "order_item" table, then I think it's fine. You still get data integrity and that is what matters.”

“I still fail to see what you mean by the "calculated TOTALS field" (attribute, really) violates the Relational Model.”

“The result of having the field ... is what is called a DELETE ANOMALY.”

“Most denormalizing means adding columns to tables that provide values you would otherwise have to calculate as needed.”

“There are four practical problems with a fully normalized database, three of which I have listed before. I will list them all here for completeness:
* No calculated values. Calculated values are a fact of life for all applications, but a normalized database lacks them. The burden of providing calculated values must be taken up by somebody somehow. Denormalization is one approach to this, though there are others.
--Database Programmer blog

“...I'm now working with IT to normalize part of the database to remove calculated fields...:
`lineitems`.`extended total` = `lineitems`.`units` * `biditems`.`price`.
`jobs`.`jobvalue` = the sum of related `lineitems`.`extended total` records
`orders`.`ordervalue` = the sum of related `jobs`.`jobvalue` records.”
--mySQL.com

Do calculated attributes (not fields!) violate relational theory and must be "normalized" out of them? Determining that requires foundation knowledge that is scarce in the industry, which has a poor and outdated understanding of the RDM.

Saturday, September 11, 2021

OBG: Data Warehouses Are Non-Relational Application Views



Note: To demonstrate the correctness and stability due to a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old DBDebunk.com (2000-06), Judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may revise, break into parts, and/or add comments and/or references. You can acquire foundation knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, even better, organize one of our on-site SEMINARS, which can be customized to specific needs).


ON DATA WAREHOUSES

(originally posted 11/10/2001)

“It is dawning on me that data warehouse techniques (as advocated by Ralph Kimball) are a response to perceived SQL DBMS performance weaknesses and nothing more. Dimensional modeling is a physical implementation design to deal with what may already be out of date performance tests to back up said design. I agree that where data warehouse logical design deviates from the relational data model there will be trouble down the road. I can attest to the high cost involved in maintaining a data warehouse. I am now questioning three purported benefits of using the current popular data warehouse design techniques.

Physical DBMS designs like the star schema produce faster more predictable query results than a normalized one (I realize normalization is a strictly logical concept but it does appear to have its direct physical mappings in current DBMS systems). Well, I am about to find out. We will be performing some benchmarks. What really are the query times performed on our OLTP system vs. a Star schema for a few relevant reports.

Data warehouses offload query processing from the OLTP system. This is true, but may not be necessary. One needs to thoroughly analyze the traffic on the OLTP system to see if offloading is necessary. We are looking into simply replicating the database (or part of it) for reporting purposes. Replication is far simpler to maintain than a data warehouse.

Data warehouse designs are simpler to understand than a relational one, thereby query construction is easier. I think this is more due to the fact that designers have had a bad habit of throwing up on a wall a full ER chart of their systems. Saying in effect Look how great I am, no one will ever understand this!  Creating ER diagrams of subsystems to describe important characteristics of a database can also be simple to understand.

Item #3 brings up a question. What do you think of OLAP tools such a Microsoft s Analysis Services? We have discovered you can use the tool without redesigning your database. We have seen some pretty fast query times if we take the option of allowing the tool to store data extracted from the production database into its own proprietary format. The opposition to its adoption here is the learning curve to master the MDX query language."
 

Fabian Pascal: The fact is that so-called "dimensional modeling" is logical, not physical modeling. Kimball does not present his "techniques" as just performance-maximization and, what is more, they won’t necessarily yield better performance. He also seems to believe, erroneously, that star-schemas are fully normalized designs. The real solution to performance problems is true RDBMSs with better implementations, not ad-hoc logical designs. Like so many, Kimball simply does not understand relational technology and confuses levels of representation.

I consider warehouses a regression to the good old days of application-biased files -- which we discarded for application-neutral and DBMSs, because they did not prove cost-effective. They are application-specific views of databases that are not derived via relational algebra from relational databases, but are rather non-relational SQL tables. The industry has a long and profitable history of recycling relabeled old failures, witness XML and graph throwbacks to hierarchic databases.

Aside from being necessary for soundness, fully normalized relational databases are the easiest to understand if (1) one thoroughly knows and understands the segment of reality to be represented in the database -- the business, that is -- and (2) data fundamentals. Arbitrary designs such as star-schemas are cope-outs, due to poor knowledge and understanding of the latter.

I am not familiar with the product, but I am generally wary of what vendors do.


Note on re-publication: See also Data Warehouses and the Logical-Physical Confusion.

 

 

 

 

Saturday, September 4, 2021

Understanding Relational Constraints



“The data in a relational database is stored in form of a table. A table makes the data look organized. Yet in some cases we might face issues while working with the data like repetition. We might want enforce rules on the data to avoid such technical problems. Theses rules are called constraints. A constraint can be defined as a rule that has to enforced on the data to avoid faults. There are three kinds of constraints: entity, referential and semantic constraints. Listed below are the differences between these three constraints:
1. Entity constraints -- primary key, foreign key, unique, NULL -- are posed within a table and used to enforce uniqueness and to define no value [respectively].    
2. Referential constraints -- foreign key -- are enforced with more than one table for referring other tables for analysis of the data.
3. Semantic constraints -- datatypes -- are  enforced in a table on the values of a specific attribute and help the data segregate according to its type. Example: name varchar2(30).”
--GeeksforGeeks.com
Before we tackle the main subject, let's get some misconceptions out of the way. As we have explained so many times:

  • Data is not "stored in a form of a table" -- it can be stored in any number of physical formats, at the discretion of DBMS designers and DBAs. Physical independence is a core advantage of the RDM.
  • A table does not "make the data look organized". Data is by definition organized -- be it relationally or not -- otherwise it would be random noise not data.  A database relation can be visualized as a R-table, but tables do not play any role in RDM.
  • While some "repetition" (i.e., redundancy) is prevented by constraints (e.g., uniqueness), others are avoided by database design (e.g., 5NF DB relations).

And now to constraints.

Tuesday, August 31, 2021

TYFK: Normalized, Fully Normalized, Non-Normalized, Denormalized -- Clearing the Mess



Note: Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can review references -- reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date -- which explain and correct the misconceptions. You can acquire further knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

“A non-normalized database is a disorganized one, where nobody has bothered to work out where the facts should be stored. It is like a stack of paper files that has been tossed down the stairs. We are not interested in non-normalized databases.

A normalized database has been organized so that each fact is stored in exactly one place (2nf and greater) and no more than one fact is stored in each place (1nf). In a normalized database there is a place for everything and everything is in its place.

A denormalized database is a normalized database that has had redundancies deliberately re-introduced for some practical gain. Most denormalizing means adding columns to tables that provide values you would otherwise have to calculate as needed. Values are copied from table to table, calculations are made within a row, and totals, averages and other aggregrations are made between child and parent tables.”
--database-programmer.blogspot.com

Friday, August 13, 2021

OBG: The Myth of Market-Based Education



(Originally posted on 09/08/2001, slightly revised)

Note: To demonstrate the correctness and stability due to a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may revise, break into parts, and/or add comments and/or references.

“In a world torn by every kind of fundamentalism -- religious, ethnic, nationalist and tribal -- we must grant first place to economic fundamentalism, with its religious conviction that the market, left to its own devices, is capable of resolving all our problems. This faith has its own ayatollahs. Its church is neo-liberalism; its creed is profit; its prayers are for monopolies.”
--Carlos Fuentes
"We as humans have an instinct for creativity and a moral instinct. A good educational system ought to nurture and encourage these aspects of human life and allow them to flourish. But of course that has problems. For one thing, it means that you will encourage challenge of authority and domination. It will encourage questioning of powerful institutions. So the way schools actually function, by and large, there's a very strong tendency that works its way out in the long run and on average, for the schools to have a kind of filtering effect. They filter out independence of thought, creativity, imagination, and in their place foster obedience and subordination."
--Noam Chomsky
"The educated person is not the person who can answer the questions, but the person who can question the answers"
--T. Schick Jr.

 

Thursday, August 5, 2021

TYFK: Facts, Properties, Relationships, Domains, Relations, Tuples



Note: Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can review references -- reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date -- which explain and correct the misconceptions. You can acquire further knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

A statement from a 1986 book that "Data are facts represented by values -- numbers, character strings, or symbols -- which carry meaning in a certain context" triggered the following response on Linkedin:
“...In contrast, Date and Darwen (2000) say:
  • Domains are the things that we can talk about.
  • Relations are the truths we utter about those things.
Thus, the declarative sentence "Fred is in the kitchen." is a fact that links the domains Person[s] and Place[s] with the predicate "is in". The complete relation might be made up of three facts:
  • Fred is in the kitchen.
  • Mary is in the garden.
  • Arthur is in the garden.
This seems to be more precise than the 1986  statement.”
To which the book author responded:
“...back then we did not have the refinement, clarity, nor precision from people like Sjir Nijssen and Terry Halpin regarding facts, or elementary fact sentences, which today you and I know are the bedrock of data modeling. Facts are expressed in sentences (with domains and predicates).”

Unfortunately none of this is sufficiently clear and precise to prevent confusion and it inhibits  understanding of the RDM.

Thursday, July 22, 2021

Documents and Databases



'These new data technologies were developed because there are new usage scenarios for data — which do not fit into the relational model.'
--Reddit.com

Don't let the NoSQL label fool you. It's the relational model (RDM), not SQL, that its proponents are really dismissing. The main argument, as advanced in a recent LinkedIn exchange, is that lots of information "cannot be represented in rows and columns". IOW, the RDM is not general enough -- there are certain types of information that it is not suited for. Ignoring the tabular nonsense, the response from David McGoveran, is important enough to restate here.
“Information consists of facts (i.e., propositions asserted to be true) about objects, properties, and relationships among objects and properties. We have shown that a database relation -- which a R-table visualizes -- is constrained to represent a set of facts about (properties of) a group of entities with within-group relationships among properties and entities and cross-group relationships. Yet we are told that document information "do not fit" in a relational structure. They are referred to as "unstructured" (which, if they were, they would contain random noise, not information).

But documents don't lack structure. Rather, they are multistructured: have complex multi-level/type structures -- lots of content, metadata, interpretations, and internal relationships (formatting, semantic, structural or syntactic, and so on). At one level of analysis, they are just documents that have subject matter or content involving objects, properties and relationships. At another they might relate to that of other data (e.g., other documents). How we represent knowledge and in how much detail is determined by which of the structures we choose to represent and that always partially determines the class of queries we can express. This is precisely what Codd understood and tried to address via the RDM.
--David McGoveran

And there's the rub: which type of data (facts) at which document level is of interest? Take this post. There are facts about it (e.g., author, title, date and so on). There are facts in it (its content). Either can can be readily represented relationally, for example:

POSTS (AUTHOR,TITLE,DATE,CONTENT)

where CONTENT is a column defined on a text, PDF, or HTML domain with built-in operators applicable to values of either of those types (e.g., a substring operator for text). Facts at other levels (e.g., grammatical, or semantic) could be of interest and would require multi-table representation. One must choose the type/level of information of interest to represent relationally in a database. We can choose to not do the analysis and modeling of the content of documents, but that does not mean that they are unstructurable as facts. More often than not data professionals don’t know what type of facts are to be represented, or are unfamiliar with data modeling and relational fundamentals. Product advocates avoid to say that without investing time and effort in analysis and modeling one cannot ask the same questions of and produce results equivalent to those from relational databases (i.e., make precise inferences from data that are guaranteed to be correct -- logically valid and semantically consistent). In fact, the use of such products trades upfront structuring effort for subsequent prohibitive manipulation effort.

As David points out, "complaints about RDM are not about knowledge representation, but knowledge discovery -- the problem, for example, that Google Search, analytics and data integration face and attempt to solve. It's an expensive, imprecise, and difficult problem", but it is distinct from what database management does and the two should not be confused.

 

 

 

 

View My Stats