Monday, February 10, 2014

"Denormalization for Performance": Don't Blame the Relational Model



 REVISED: 10/18/16

Many common misconceptions are excellent indicators of poor grasp, if any, of  the relational data model (RDM). One of the most entrenched is the notion of "denormalization for performance".

I will not get into the first four of the 5 Claims About SQL, Explained. I do not disagree with the facts, except to point out that the problems are not due the relational nature of SQL and its implementations, quite the opposite: it is due to their poor relational fidelity. I will focus on the fifth, "Should everything be normalized?"


Note: I must again stress the distinction between:

  • Normalization (to 1NF): Elimination of attributes that are not defined on simple domains without meaningful components, whose values are treated as atomic by the data language;
  • Further normalization (beyond 1NF up to 5NF): Replacement of relations that represent facts of multiple types with projections that represent facts of a single type.
They are qualitatively different, but the term 'normalization' is used interchangeably for both -- given the context, Bolenok uses it in the latter sense. We explicitly use 'further' when we refer specifically to the latter.

I do not know about "everything", but yes, for practical reasons, all relations should be fully formalized (in 5NF). If adhered to, the Principle of Full Normalization (POFN) ensures that the database complies with three principles of formal systems that give them desirable properties [1]:

  • The Principle of Orthogonal Design (POOD)
  • The Principle of Representational Minimality (PORM)
  • The Principle of Expressive Completeness (POEC)
Chief among the several benefits from compliance with the three principles via full normalization is guaranteed semantic correctness -- no undesirable side effects (i.e., anomalous results). One practical implication is view updatability and logical independence (LI) -- the insulation of applications from non-loss logical reorganizations of the database. Because SQL databases are not guaranteed to be fully normalized, multi-relation views are not updatable, with loss of LI.
"Database [full] normalization, to put it simply, is designing a database foolproof to incorrect input of data ... Basically, if you notice that a change you make to a database requires more than one operation, your database is not [fully] normalized. Say, a person is promoted and you need to update their position in more than one table. Or you want to insert a record for a new accounting year and find that you need to update the closing date of the previous one as well. Or before updating a field you have to parse its previous contents. If it's possible to break your database consistency by "forgetting" to do something, the database is not normalized (in some sense)."
This is simply incorrect. Full normalization means, formally, that the only dependencies that hold in each and every relation are those of the non-key attributes on the key (informally: each relation represents facts of a single type). They contain more relations, but fully normalized databases
  • Minimize the integrity risk due to redundancy;
  • Guarantee semantic correctness and LI;
  • Are less complex, easier to understand and work with;
  • Are less error prone;
  • Produce results easier to interpret;
  • Are easier to optimize. [2]
"The drawback of [full] normalization is, of course, performance. Denormalization allows you to store copies of data in the places where they are more readily available to the queries. Instead of joining two tables in a business-critical query you run often, you might want to consider just storing the joined records together in a third table. This comes at a price of course: you should pay attention that the data are consistent between the three tables, however, this might be worth it, provided that benefits from making the query faster would justify additional development costs."
Points arising:
  • Logical-physical confusion (LPC): Performance is determined exclusively at the physical level and logical design has nothing to do with it;
  • Database application-specific bias: Denormalization biases a database for some application and against others, which defeats the core purpose of databases to be application-neutral.

Practitioners often complain that the RDM is "not practical" because it "ignores" physical implementation and performance. But physical independence (PI) -- the insulation of applications from storage database reorganizations -- is a core, if not chief objective and advantage of the RDM, not a flaw: the RDM does not ignore them, it relegates them to where they productively belong -- away from application developers and end users, to the DBMS (vendors) and DBAs. The RDM allows them to do whatever they deem physically appropriate to maximize performance, without invalidating existing applications. This includes the freedom to "store the joined records together in the same file", which even SQL DBMSs do. DBMSs that don't have only themselves to blame -- it is has nothing to do with the RDM and, in fact, it is a violation of the relational spirit.

But let's assume, for the sake of argument, that you fully normalized your database, exhausted all the physical options provided by your SQL DBMS -- which is where the problems are -- and performance is still unsatisfactory. You denormalize and, as Bolenok recognizes, introduce redundancy. To ensure database consistency you must now have the DBMS control the redundancy (i.e., guarantee that all redundant instances are updated). This requires enforcement of special integrity constraints -- the "additional development costs" that Bolenok refers to -- but they would never be justified: those constraints enforce the very joins denormalization is supposed to avoid and would defeat its purpose [2]. In fact, many practitioners are either unaware of the need for such constrains, do not know how to formulate them, or SQL DBMSs may not support them. They are never enforced, which creates the illusion that denormalization improves performance -- the gains, if any, come from trading consistency for performance -- failure to enforce the constraints -- not from logical denormalization. [2]

Well implemented true RDBMSs would would inherently demand and perform well with fully normalized databases and would not impose the consistency-performance tradeoff and its costly implications inherent in denormalization. But as long as grasp of data fundamentals is weak [3], RDBMSs are confused with SQL DBMSs and users tolerate the tradeoff as a solution, there is no reason to expect any.


References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLKS, forthcoming (draft chapters).

[2] Pascal, F., The Costly Illusion: Normalization, Integrity and Performance, PRACTICAL DATABASE FOUNDATION paper #2. http://dbdebunk.blogspot.com/p/papers_3.html



[3] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS.




2 comments:

  1. What a nonsense conclusion. Normalization is and will always be a direct trade-off. You give up performance (time) for space. Indexing mitigates in the opposite direction of this trade-off.

    ReplyDelete
  2. You're not even wrong. Speaking of nonsense.

    You just provided a nice "Quote of the week".

    ReplyDelete

View My Stats