You've probably heard the frequent argument that relational databases
(which, unfortunately, in practice, means SQL ones) do not serve the
performance, flexibility, and temporalization needs of analytical
applications satisfactorily. Indeed, Anchor, Data Vault, and Dimensional
Modeling techniques are promoted as solutions to the "problems" due to
normalized databases. All this is rooted in certain fundamental
misconceptions that can be costly for business intelligence, analytics,
and data science.
--------------------------------------------------------------------------------
I have been using the proceeds from my monthly blog @AllAnalytics to
maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been
discontinued. I appeal to my readers, particularly regular ones: If you
deem this site worthy of continuing, please support its upkeep. A
regular monthly contribution will ensure this unique material
unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.
---------------------------------------------------------------------------------
What the industry calls a warehouse is essentially a materialized
view of an operational database that typically doesn't get updated in
full sync with the underlying database -- i.e., a time-specific snapshot
database copy. Such views are produced by data manipulation -- applying
some operations to the database. In effect they are query results. Many
if not most data professionals overlook a unique and crucial feature of
truly relational DBMSs and databases -- their formal dual theoretical
foundation -- simple set theory (SST) and first order predicate logic
(FOPL). If relational set operations are properly applied to fully
normalized database relations, results are guaranteed to be logically
and semantically correct.
Thus, (1) given a true RDBMS and a fully normalized relational
database (2) a "transformation process" is triggered (by documented
circumstances) that applies relational algebra operations to the
database to produce the desired view, which (3) is materialized by
persisting it in storage. If either the design of the database or the
transformation applied by the DBMS violates the theory, correctness is
no longer guaranteed.
While warehousing is often undertaken to improve performance, the
transformation is purely logical and performance is determined
exclusively at the physical implementation level -- how the data is
represented and accessed in storage. The only physical implementation
aspect of data warehousing is the materialization of the view, which is
independent of warehousing techniques.
Warehouses are essentially databases biased for some data
applications (and against others) and are rooted in poor database
foundation knowledge and logical-physical confusion. Even when
warehouses consist of relations, warehouse developers often do not
understand their precise interpretation. The design is based on various
unwarranted or false assumptions about what the data means. The above
modeling methods do not allow documenting the transformation itself --
the relational algebra operations that comprise the transformation. But
more often than not warehouses do not consist of relations, which are
minimally required to be in first normal form (1NF), and are, therefore,
not just denormalized, but non-relational. Consequently, all bets are
off; sound derivations of correct analytical results are not guaranteed.
Neutral databases that serve different data views to multiple
applications were introduced as a solution to the prohibitive problems
caused by application-specific biased files. Application programs
needing different views of the same data required differently structured
files, proliferating redundancy and inconsistencies. Each and every
application program had to enforce data integrity and security, and
optimize performance -- functions now centralized in the DBMS.
Application-based integrity enforcement created a redundant, complex,
and error-prone maintenance burden that was so prohibitive it was mostly
foregone. Due to lack of familiarity with history, application-biased
databases are bringing those problems back. Those who forget the past
are doomed to repeat it.
Because warehouses are read-only, the risks to data integrity may be
smaller. They are confined and exclusive to the developer responsible
for the transformation and the warehouse load procedure, an environment
that is more controlled than the shared operational database that are
usually updated by many applications/users. But warehouses are populated
by SQL DBMSs that are not truly relational, from poorly designed
operational databases, so all bets are off.
It sometimes makes sense to offload data and analytics from
operational databases. But without the relational guarantee, caveat
emptor.
Today people have problem understandig science (based on mathematics - logic, sets) and some srbitrary concepts that are based on someone's authority or 'industry standard' devoid of any logical framewor. (Hello Object oriented and data warehousing ;-) )
ReplyDeleteData Warehousing 'science' suffers from lack of logic. It contradicts itself: it is supposed to be used for ad hoc querying, yet it requires aggregation and assumption about intended use. It is built for 'speed' yet cube queries take hours to execute - no kidding - this is from a Microsoft sponsored course. Data cleaning is the most laughable requirement - isn't data already clean if it comes from a relational database? What is the purpose of keeping data somewhere if it is not clean? I it was not clean at the moment of entry into database/file/spreadsheet, how are we going to clean i? Connecting data from heterogeneous sources. Like SQL databases, Access files, Excel tables and Word documents? Thank you, but no thank you. Analytic for business? Good luck with present skill level of executives and decision makers. Since executives are not capable doing simplest things in spreadsheets, we must invent mambo-jumbo arbitrary concepts, and make it look serious and scientific, to keep our jobs? Again, thank you but no thank you.