I am coming across, on the one hand, instructions on how to do "analytics with SQL" and, on the other, tools purporting to enable "analytics without SQL." They are an umpteenth iteration of essentially similar ideas during my 30-plus years in data management and reflect common and entrenched fundamental misconceptions that I have documented and analyzed the costly consequences of in my writings and teachings. They will keep repeating, inhibiting genuine progress, as long as data fundamentals are ignored or dismissed. One of the least understood is the distinction between DBMS and application functions.
--------------------------------------------------------------------------------
I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.
---------------------------------------------------------------------------------
DBMS functions are the responsibility of and performed by the DBMS 'for all applications'. There are several categories thereof, but here we will focus on the one including the three functions defined by a data model such as the RDM: structure, integrity and manipulation. A true RDBMS has a relational 'data sub-language' with which (1) relations and (2) integrity constraints are defined and data can be retrieved by (3) manipulating relations mathematically as sets -- i.e., applying the set operations of the relational algebra on them. Restricting the sub-language to the three data management functions makes first order predicate logic (FOPL) sufficient, with the following advantages:
- The language is declarative, decidable, and simpler;
- Logical validity and semantic correctness of query results are 'system-guaranteed';
- Physical and logical independence are supported.
- Is imperative (procedural), prone to undecidability and significantly more complex;
- Logical validity and semantic correctness are not system-guaranteed;
- Physical and logical independence are not supported.
Consider now the above two "solutions" for analytics. SQL was intended to be a relational data sub-language and, as such, should have expressed strictly the three data management DBMS functions. So, first, forcing SQL to perform analytics functions is a type (a) violation. Second, analytic functions are application-specific and, therefore, tools that purport to "perform analytics without SQL" is a trivial promise that should be the case 'by definition.' The correct perspective for 'databased analytics' is, of course, that applications use CCLs that (1) host SQL for data management DBMS functions -- data retrieval -- and (2) perform analytics functions that do not require, or rely on SQL.
This should not be taken as a SQL recommendation for data management functions. It is neither truly and fully relational, nor a well designed language, nor a strictly data sub-language, as it haphazardly includes some application-specific functions. But unfortunately, SQL is the industry's standard database language that has no superior alternative. This is in itself due to its authors lacking a good grasp of data fundamentals and the RDM and contributes to the persistent failure to understand and appreciate the optimal division of labor and to provide tools that enforce it.
Therefore, to avoid the traps that industry practices and products impose, it behooves the analyst to acquire foundation knowledge.
No comments:
Post a Comment