“Scientific research experiments that "require assignment of data to tables, which is difficult when the scientists do not know ahead of time what analysis to run on the data, a lack of knowledge that severely limits the usefulness of relational [read: SQL] databases.”NoSQL are recommended in such cases. But what does "scientists do not know ahead of time what analysis to run" really mean?
“1. Data: Categorized sequences of values representing some properties of interest, but if and how they are related is unknown (e.g., research variables in scientific experiments);
2. Information: Properties further organized in named combinations -- "objects", but how they are related is unknown (e.g., "runs", or "cases" in scientific experiments);
3. Knowledge: Relationships among properties and among objects of different types are known.”
--David McGoveran
“A particular bug-bear and a mistake that +90% of "data modelers" make, is analyzing "point in time" views of the business data and "normalizing" those values hence failing to consider change over time and the need to reproduce historic viewpoints. Let’s say we start with this list of data-items for a Sales-Invoice (completely omitting details of what’s been sold):
SALES-INVOICE
{Invoice-Date,
Customer-Account-ID,
Customer Name,
Invoice-Address-Line-1,
Invoice-Address-Line-2,
Invoice-Address-Line-3,
Invoice-Address-Line-4,
Invoice-Address-Postcode,
Net-Amount,
VAT,
Total-Amount
};
Nearly every time, through the blind application of normalization we get this ... there’s even a term for it -- it’s called "over-normalization":
A measure of scarcity of foundation knowledge in the industry are the attempts to correct a plethora of common misconceptions[2] that suffer from the very misconceptions they aim to correct. One of the most common fallacies is confusion of levels of representation[3] that takes two forms[4]. We have written extensively about the logical-physical confusion (LPC)[5,6,7,8] underlying "denormalization for performance"[9], and the conceptual-logical conflation (CLC) that lumps conceptual with data modeling[10,11,12], inhibiting understanding that the latter is formalization of the former.SALES-INVOICE
{Invoice-Date,
Customer-Account-Id
REFERENCES Customer-Account,
Net-Amount,
VAT,
Total-Amount
};
CUSTOMER-ACCOUNT
{Customer-Account-Id,
Customer-Name,
Invoice-Address
REFERENCES Address
};
ADDRESS
{Address-Line-1,
Address-Line-2,
Address-Line-3,
Address-Line-4,
Postcode
};”
“I just wanted to drop a note of thanks for the website, especially the latest articles on understanding data modeling, which among other things, explains very nicely the difference between the application of set theory and graph theory. It parallels in the real world with the community (set of data elements) and the individual (node in a network) and how it is easier to connect communities (RDM), but how much more complex it would be to connect individuals directly (GDM) without going through such a community connection arrangement (e.g. e-mail, postal system).”
“I'm currently working out the concept of what I call CMCs or contextual metadata connectors. I'm sure such entities will be heavily dependent upon the usage of RDM to do their job. In the project, I would like to use both approaches (RDM, GDM) due to the power of set theory and graph theory, but exactly where one should do so is so critical.”
“It's exciting to think of the endless potential for AI-based automation when one correctly leverages the underlying principles of data relationships. Since my discovery in 2004 about a much better way to approach test automation which I called data-centric (vs. the code-centric industry standard), I have found that it applies anywhere there is data, as long as one holds to a proper understanding of data and how to view it relationally.”
“What I find very surprising though is how rare it is to find in the I.T. industry a proper understanding of data, especially when viewing it relationally. It is indeed one of the most massively misunderstood aspects of the I.T. industry to this day, as your website alludes to. Rather than running away from it, RDM should be the very first course taught in any program involved in either computer science or information science. Maybe then I wouldn't always be losing people in technical conversations whenever I start talking about it. I see a diamond and they just see carbon.”
“Up to 2018, DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. In 2018 that website was discontinued. You will not find its content anywhere else, so if you deem it useful, particularly if you are a regular reader, please back up your appreciation with concrete support -- purchase publications, or regular donations. Thank you.”Codd was explicit about introducing the set-based RDM to relieve what he called "non-network applications" -- concerned with relationships among groups of entities -- from the complexity burden of directed graphs for network applications concerned with relationships among individual entities. But this too, like so many other aspects of his work, was missed/ignored. Witness the GDBMS revival and promotion as "superior to RDBMSs" (which are confused with SQL DBMSs), without any reference to their distinct application domains.