Monday, June 17, 2024

SQL AT 50, OR WHY THERE ARE NO RDBMS'S



In "Codd Almighty!  Has it been half a century of SQL already?" the Register's Lindsay Clark interviews "Donald Chamberlin, Michael Stonebraker and more" about the legendary programming [sic] language. Chamberlin with Raymond Boyce were the authors of "the 1974 paper SEQUEL: A structured English query language as a way of addressing data in IBM's newly proposed System R, the first database to embody Edgar Codd's paper describing the relational model for database management.”

C. J. Date, who worked at IBM at the time, has often stated that the designers of SQL never understood RDM, and I expressed a similar stance in If You Liked SQL, You'll love XQuery. This has had an extremely detrimental effect on database technology--regress rather than progress--none of which transpires in the interview. So here is my reality check take on what you would not know from the interview.

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE
DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

USING THIS SITE
- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
SEARCH  page. For detailed instructions on how to understand and use the labels in conjunction with that page, see the ABOUT page. The 2017 and 2016 posts, incl uding earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my AllAnalytics columns no longer work. I re-published only the 2017 columns @dbdebunk, and within them links to sources external to AllAnalytics may or may not work.

SOCIAL MEDIA
I deleted my Facebook account. You can follow me @DBDdebunk on X.

------------------------------------------------------------------------------------------------------------------

“In the fifty years since SQL was first proposed, it has become the dominant query language for working with relational databases. The clue is in the name. Although the acronym might stand for Structured Query Language.”

While SQL is much superior to what preceded it due to the little relational fidelity it has, it is  not--industry fallacy notwithstanding--not anywhere near what it should, and could have been, what it was intended to be--a relational data sublanguage.

  • It is not a structured (in the programming sense) language;
  • It is not just for queries;
  • It has been badly overloaded beyond a data sublanguage;

“IBMers Donald Chamberlin and Raymond Boyce wrote the paper ... in 1974 as a way of addressing data in IBM's newly proposed System R, the first database to embody Edgar Codd's paper [PDF] describing the relational model for database management. "Codd said, 'No that's all wrong, we should use a declarative approach and let the computer figure out how to find the answer.' That was the conflict that was raging in the industry around the early '70s," Chamberlin said.”

Codd (EFC) had very strong criticisms of IBM efforts in general, and SQL in particular--it would not be an exaggeration to say that he rejected it--due not just to lack of declarativity, but its many RDM violations (see Date and my writings on SQL). SQL is somewhat more declarative than computationally complete programming languages (CCPL), but suffers from a kind of procedurality with a major disadvantage--language redundancy (see below). For data sublanguages closer to the declarative mark see Codd's own Alpha, Zloof's Query-by-Example, and maybe Query-by-Forms.

“In their first attempt at creating a language for addressing the new concept of relational databases, Chamberlin and Raymond Boyce created SQUARE or Specifying Queries in A Relational Environment.But it relied on a finger-mangling combination of subscript and superscript notation, and by Chamberlin's own admission, it was difficult to type.”

I am not familiar with it, but it sounds like it confirms Date's and my stance.

“"The relational model was simple and elegant and powerful. But Codd presented his concepts in a lot of mathematical jargon. He's basically a mathematician, and the first time we read his paper, it wasn't really obvious how elegant and simple this concept was. He took a simple concept and made it complicated. The other problem was that it wasn't clear at first whether these ideas could be efficiently implemented within a program that would have adequate performance," Chamberlin said.”

It is true that Codd, being a genius, expressed RDM in a way difficult for mere mortals to understand. That is precisely why "database people"--competent in understanding the theoretical foundation of RDM and express it in a language that exploits its advantages, should have been assigned to the task of concretizing RDM in a data sublanguage. Sadly, SQL is clear evidence this was not the case.

“System R was IBM's attempt to demonstrate that the relational model could be executed practically by the computers available at the time, but because Codd's paper had been published in a journal, the  IBM team were not the only ones working on the problem. Up the coast near San Francisco, another team at University of California Berkeley was working on a similar project. At Berkeley in the early '70s, post-doctorate Michael Stonebraker and fellow computer scientist Eugene Wong began a project to build Ingres, another early relational database, at the same time as System R. With it, they developed their own query language, QUEL. "Chamberlin and Boyce basically turned SQUARE into a language you could type on a keyboard, but it had this nested structure to it," Stonebraker explained to The Register.”

Actually, IBM was trying to demonstrate the exact opposite. At the time it had a cash cow in IMS, its hierarchical DBMS, and Codd was visiting clients telling them how bad it was, and that they should migrate to a RDBMS. So IBM set up the System R research project in the hope that it'll prove (in the context of the hardware available at the time) that it won't fly. Chamberlin admits as much:

“"It was a frustration for me, that's for sure, and for other members of the System R team. This was a business decision that IBM was making. They had a successful commercial database product [in hierarchical IMS]. Why would they want to introduce a competitor for it? It took a few years for them to figure out. In the meantime, the System R group published papers both on SQL and on the optimizing compiler in the open technical literature. IBM was very generous allowing us to do that, mainly it was because they didn't take the technology too seriously from a business point of view: this was research.”

The crucial difference between Berkeley's QUEL and IBM's SQL is that former's designers understood RDM. QUEL is a declarative language, specifically, it lacks nested queries--precisely what makes SQL somewhatprocedural and redundant.

“However, there were queries that could not be expressed in a nested notation. The result was, Sequel added a flat notation to the nested notation, which made the language "unnecessarily complicated," Stonebraker said."On the other hand, QUEL started off with a flat notation from the very beginning. There's no nested notation. Sequel was hampered by the inclusion of the nested notation. But the trouble is, you can't take it out once you put it in," Stonebraker said.”

Per my above linked paper, the problem is much worse: nested queries were a work around invoking  relational operations (join, union, etc.) by name ("too mathematical"), which caused redundancy; but when it was realized that some (e.g., union) could not be expressed with nested queries, all of them were added anyway, rendering SQL doubly redundant!

“With its overwhelming dominance of the database market and backing of the world's international standards body, SQL can show a surprising degree of variation between implementations. The differences between versions of SQL can lie in efforts to absorb new ideas from the database world, explains Andy Pavlo, associate professor of databaseology at Carnegie Mellon University."What makes SQL resilient is its ability to absorb the best ideas for attempts to supplant and replace it over the years. As fashionable approaches in computing evolve over time, SQL is able to take the best ideas and incorporate them. A classic example would be object oriented databases in the 1980s and 1990s. Another would be XML. More recently document store MongoDB took off in the late 2000s, and then SQL adapted by adding JSON support in 2016" he said.”

Unfortunately, "resilience" had nothing to do with "intended absorption" and everything to do with how the industry operates, which is not how a language--let alone a standard--should be created. Neither should be designed and maintained by committee, let alone one of vendors with their own implementations, and--in the case of the SQL standard--certainly not by members who mostly did not know RDM (in fact, the ANSI SQL standard committee insisted that the word 'relational' should not appear once in the published standard; it also made PKs optional, even as they used unique identifiers to track their documents).

You first set a standard, before any implementations, and then vendors conform their implementations. But the industry operates in an upside down/backwards way: each vendor designed its own SQL dialect, which, once in use is hard to change, and came to the committee to push for as much of its own dialect to be adopted in the standard. At the time IBM was the big gorilla of the industry--almost totally dominant--such that nobody could afford to be "non-compliant with IBM", it imposed its SQL on the committee, and the rest is non-relational history.

“Stonebraker said the implementation of Quel in Ingres was substantially better than Oracle's adoption of SQL. Stonebraker said Oracle had a head start and was achieving strong growth using "sales tactics I would not condone." However, he said that Ingres might have overtaken Oracle in 1985 were it not for an intervention by IBM. In late 1983, Big Blue introduced Db2, which, although not its first commercial relational database, was soon to become the flagship of a new vanguard. Db2 used SQL, and with IBM's dominance of the mainframe market, it was game over for rival languages, whether they were technically superior or not. But for Chamberlin it was the approval of national and then international standards bodies, and the adoption by US government purchasing authorities, which helped win the day for SQL.”

With SQL being sold as relational to an industry lacking RDM knowledge and dominated by IBM, SQL (and IBM) bears primary responsibility for why are there no relational DBMSs.  The notion that the best somehow comes on top is propaganda to fool the masses.

“However, support for JSON data types--widely used for exchanging data on the worldwide web — illustrates an issue with SQL: namely, it is not as standardized as many would assume. Pavlo points out that while JSON became part of the SQL standard in 2016, PostgreSQL had JSON support in 2012. Oracle and MySQL also had their own variations of JSON support. "It's like everyone kind of does their own one off thing, and then you try to get to the lowest common denominator and say, 'This is what the standard should be.' Often some databases have their own proprietary version, or variation of how to do some kind of new operation, then by the time the SQL standard comes along, they're not going to go back and change things," Pavlo said. The effect means different vendors have different versions of SQL which can trip up developers and DBAs: Microsoft has T-SQL, Oracle PL/SQL and PostgreSQL PL/pgSQL, for example. “However, support for JSON data types--widely used for exchanging data on the worldwide web-- illustrates an issue with SQL: namely, it is not as standardized as many would assume. Pavlo points out that while JSON became part of the SQL standard in 2016, PostgreSQL had JSON support in 2012. Oracle and MySQL also had their own variations of JSON support.”

Exactly. Note, however, the example used: JSON data types. This is decisive evidence that neither Pavlo, nor the vendors to this day really understand RDM (the issues with complex data types--if they are not just "lazy design"--is orthogonal to RDM and will be discussed in myforthcoming paper on relational domains.)

There is one thing without which it is impossible to use SQL effectively--by which I mean take advantage of whatever relational fidelity it does have and avoid the consequences of the plethora of its flaws (see Date's and my writings on SQL), no matter how much experience you have: knowledge and understanding of RDM, and that is inexistent in the industry. Aside from that and SQL being a non-relational, poorly designed "language", there is also the way it is taught--syntactically, like a programming language: "this is a SELECT clause, this is a FROM clause,..." and so on. I used to ask audiences "What relational operations do SELECT, FROM and WHERE clauses express? and get blank looks. Do you know?

“Following the ANSI standard, the International Organization for Standardization (ISO) created its standard for SQL in 1986. The most recent update from 2023, includes new features related to JSON and property graph ... Despite SQL's success, Chamberlin sees the need for NoSQL databases and languages to help address data needs in modern applications.

I rest my case.

“"We wanted the language to look as closely as possible to natural language so they could read it and understand it, like it was an English sentence. But of course, English is not an orthogonal language, so we made some compromises there," he said.”

Computing languages that "look like natural language" is in itself a questionable proposition, but be that as it may, SQL may be a bit closer to NL than CCPL, but that is not saying much.

“Despite his language losing out, Stonebraker says that the database industry has benefited from having a single "intergalactic" standard in SQL. "I don't see anything supplanting it at this point, unless for some reason the relational database goes out of favor, but I don't see that happening. AI might allow people to use natural language to compile SQL, but that just puts another front end on it ... There are 20 or 30 popular programming languages in the software development world. In the database world, there's only one."”

Sure. But the fact remains that a huge opportunity was missed. Witness the absurdity it induced: RDM is blamed for SQL's flaws that are due to its violations of RDM. In other words, RDM is being blamed for having not been implemented!

I wouldn't be me if I did not debunk some of the comments. But when I proceeded to do it, I discovered perhaps the best evidence for my arguments in this post: not one mentioned RDM--the raison d'etre of SQL--all were about SQL implementations, all believing they have been using RDBMSs.

 

 

 

No comments:

Post a Comment

View My Stats