“The choice of good InnoDB primary keys is a critical performance tuning decision. This post will guide you through the steps of choosing the best primary key depending on your workload ... You would be surprised how many times I had to explain the importance of primary keys and how many debates I had around the topic as often people have preconceived ideas that translate into doing things a certain way without further thinking.”
--Yves Trudeau, Principal architect, Percona.com
I will be labeled "pedantic" and a "purist" for saying this, but there is no such thing as "PK tuning". "Choice of good PK for performance" reflects logical-physical confusion (LPC), which, in turn, is rooted in lack of foundation knowledge, and failure to grasp the RDM -- not just by database practitioners, but even by DBMS designers.
14 years ago I published an editorial with this title in response to a Slashdot.com exchange triggered by one of my articles at DBAzine.com. The objective was to illustrate some of the damage done by the collapse of education to the IT industry, reflecting a trend in the society as a whole. Things have only gotten worse since then, and for quite a while I considered a re-write. A recent Linkedin exchange provided the impetus, so here it is.
“A short time ago a colleague asked me where he could find a "Databases 101" guide for the non-technical professional. As it turns out, the internet is littered with information, and mis-information, regarding data and databases. This makes it difficult for someone entering the field of data and databases to understand the bigger picture. He wanted something to help make sense of data and databases, specifically relational versus NoSQL. So, that’s what I decided to work on. This is the start of my Databases 101 guide for the non-database professional; something to help anyone understand why the word “database” is an overloaded term. We use the term "database" to describe a great many things ... I use the terms database and database engine interchangeably in this article.”
--Databases 101, ThomasLaRock.com
To “understand the bigger picture”, and “make sense of data and databases” requires foundation knowledge. In its absence, efforts to educate defeat their own purpose, because they are grounded in the very misinformation they should be correcting. This is the mechanism by which misconceptions[1] are perpetuated and reinforced.
(See Part 1 and Part 2)
“A common term used in database design is a "relational database" -- but a database relation is not the same thing and does not imply, as its name suggests, a relationship between tables. Rather, a database relation simply refers to an individual table in a relational database. In a relational database, the table is a relation because it stores the relation between data in its column-row format. The columns are the table's attributes, while the rows represent the data records. A single row is known as a tuple to database designers.”
“A relation, or table, in a relational database has certain properties.”
“First off, its name must be unique in the database, i.e. a database cannot contain multiple tables of the same name.”
“Next ... as with the table names, no attributes can have the same name.”
“Next, no tuple (or row) can be a duplicate. In practice, a database might actually contain duplicate rows, but there should be practices in place to avoid this, such as the use of unique primary keys (next up). Given that a tuple cannot be a duplicate, it follows that a relation must contain at least one attribute (or column) that identifies each tuple (or row) uniquely. This is usually the primary key. This primary key cannot be duplicated. This means that no tuple can have the same unique, primary key. The key cannot have a NULL value, which simply means that the value must be known.”
“Further, each cell, or field, must contain a single value. For example, you cannot enter something like "Tom Smith" and expect the database to understand that you have a first and last name; rather, the database will understand that the value of that cell is exactly what has been entered.”
“Finally, all attributes—or columns—must be of the same domain, meaning that they must have the same data type. You cannot mix a string and a number in a single cell.”
“All these properties, or constraints, serve to ensure data integrity, important to maintain the accuracy of data.” --Definition of Database Relation
It is easy to discern when explanations of relational features are not grounded in the formal foundations of the RDM[1], but in industry practices. Here are some further clarifications and corrections.
Note: This is a re-write of an older post, to bring it into line with the McGoveran formalization and interpretation [1] of Codd's true RDM, which includes refinements, corrections, and extensions of his own.
“Codd has based his model on n-ary relations and that is the key mistake he has made; that leads to complex structure (absolutely not necessary) and situations where there are no values known and as a consequence the need of the concept we know too well -> the null pointers; binary relations (smallest possible) are sufficient to express any predicate/sententional formula and there is no possibility to have something like null; if a value is unknown then we do not know it thus it is not a fact for us thus it is not in our database.”
We suggest you test your foundation knowledge by figuring what's wrong with this picture before you proceed.
Note:
This is a re-write of two older posts (which now link here), to bring
them into line with the McGoveran formalization and interpretation of
Codd's real RDM, including his own refinements, corrections, and
extensions[1].
(Continued from Part 1)
I started Part 1
with comments exposing confusion (among the many misconceptions rampant
in the industry[2]) of database relations with tables (reinforced by
SQL), and of levels of representation. Debunking them requires
foundation knowledge lacking in the industry, so I provided some: I
explained the difference between mathematical relations, database
relations, and relational tables. We are now in a position to debunk the
comments.
Note: This is a re-write of two older posts (which now link here), to bring them into line with the McGoveran formalization and interpretation of Codd's real RDM, including his own refinements, corrections, and extensions[1]
“Put simply, a "relation" is a table, the heading being the definition of the structure and the rows being the data.”
“In simple English: relation is data in tabular format with fixed number of columns and data type of each column. This can be a table, a view, a result of a subquery or a function etc.”
“Practically, a "Relation" in relational model can be considered as a "Table" in actual RDBMS products(Oracle, SQL Server, MySQL, etc), and "Tuples" in a relation can also be considered as "Rows" or "Records" in a table.”
“In common usage, however, when someone refers to a "relation" in a database course, they are referring to a tabular set of data either permanently stored in the database (a table) or derived from tables according to a mathematical description (a view or a query result).”
“In SQL RDBMSes (such as MS SQL Server and Oracle] tables are permently stored relations, where the column names defined in the data dictionary form the "heading" and the rows are the "tuples" of the relation. Then from a table, a query can return a different relation.”
“Data is stored in two-dimensional tables consisting of columns (fields) and rows (records). Multi-dimensional data is represented by a system of relationships among two-dimensional tables.”
“I read [that] "Relations are multidimensional. They are not flat. They are not two dimensional. Don't let the term table mislead you." on the back cover of CJ Date's DATABASE IN DEPTH. Can anyone help how to visualize this multidimensional nature of relations?”
Because SQL DBMSs have been sold as relational databases (which they are not), and in SQL the data structure is the table, in the absence of foundation knowledge[2] most practitioners think that relational databases consist of tables, but do not ask themselves why and how is that significant for database practice. The subtitle of this post is a question I used to ask in presentations years ago that always got silence. I see no evidence of improvement -- in fact, it's gotten worse. To emulate Feynman, "Nobody understands the RDM".
That such a simple and commonly understood structure can visualize relations is an advantage of the RDM, but a table is not a relation and, SQL notwithstanding, confusing the two reflects a lack of understanding of the RDM, misses its significance for database practice, and prevents taking full advantage of its benefits.
Note: The table is the preferred way to picture relations, there are others (e.g., array).
First, the fundamentals.