Monday, December 5, 2016

Prediction, Explanation and the November Surprise



Note: My November post @All Analytics, which I reposted here.  

Given the overhyped promise of "data science", the "shock" at the broad failure to predict the election outcome was inevitable. Skimming through the media and technical accounts, it looks like a better understanding of prediction and explanation is necessary for less surprises and sounder analytics. Let's take two examples (oversimplified somewhat to make the point). 

First, a game-theoretic account derived from observed behavior in a 2-player game in which one player gets a sum of money and decides how to share it with another, who can only accept or reject the offer: even though accepting any offer as better than nothing is rational, "we don’t behave rationally ... [but] emotionally ... we reject offers we consider unfair".

"… there’s been plenty of economic growth inside the U S--vastly increasing the pile of money to be divided. But ... The first player consists of those people who have benefitted from globalization and trade: the “elites”, derisively referred to as “the 1%”. And the second player ... everyone ... who aren’t in those upper income echelons ... are seeing the pile of money in the game growing ever bigger. And ... the other player keeps an ever-larger share of that pile for themselves ... Trump allowed them to channel their feelings into a rejection of the proposal that has been made—on trade, immigration, and globalisation, and dividing up those spoils ...[and they threw] everything out". --What voters do when they feel screwed--the economic theory)

Second, a complex algorithm that runs a multitude of sophisticated simulations on a "raft of carefully collected public and private polling numbers, as well as ground-level voter and early voting data”. Assume that “the raft” consists of, vote predictors—vote correlates discovered by computers (Whatdidn't Clinton’s data-driven campaign's algorithm named Ada see?).

Suppose (1) an appropriate hypothesis in the form of a correlation at the aggregate level between variables measuring affinity to the first and second player and vote had been derived in the former case which proved accurate and (2) the algorithm in the latter case produced an equally accurate prediction.  Is there any difference between the two approaches?

For those who equate prediction with explanation, the answer is yes. For those for whom explanation is about the past and prediction about the future, the question does not come up. But these are views that obscure rather than enlighten.

In both cases there is a data pattern in the form of predictive correlations. In the first case a theory of individual behavior specifies the causal mechanism—the individual behavior that explains how the pattern is produced--why it exists at the aggregate level. In the second case, the mechanism is of no particular interest and is not specified. In general explained behavioral predictions are more reliable than those without.

Data patterns discovered by computers explained can produce insights—causal mechanisms— for theory development, this is what data mining should be about. That's the context of discovery in science, which requires predictions from the theory developed from the discovered patterns to be tested in the context of validation on different data. But because, unlike in natural science, human behavior is not governed by unchanging universal laws, it is easier to explain post-hoc than to predict. Given the pressure for prediction in industry and politics, the temptation not to bother with the second context is too strong. 

In this age of "big data", "data mining", "data lakes" and machine learning the important difference between prediction and explanation should be understood and kept firmly in mind when performing analytics and assessing their results.

See also Unthinking Machines.

 




Re-write



See the rewrite
Class, Type, Relation and Domain in Database Management



Monday, November 28, 2016

This Week



THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS available to order here.



1. What's wrong with this picture?

"Our terminology is broken beyond repair. [Let me] point out some problems with Date's use of terminology, specifically in two cases:
  1. "type" = "domain": I fully understand why one might equate "type" and "domain", but ... in today's programming practice, "type" and "domain" are quite different. The word "type" is largely tied to system-level (or "physical"-level) definitions of data, while a "domain" is thought of as an abstract set of acceptable values.
  2. "class" != "relvar": In simple terms, the word "class" applies to a collection of values allowed by a predicate, regardless of whether such a collection could actually exist. Every set has a corresponding class, although a class may have no corresponding set ... in mathematical logic, a "relation" *is* a "class" (and trivially also a "set"), which contributes to confusion.
In modern programming parlance "class" is generally distinguished from "type" *only* in that "type" refers to "primitive" (system-defined) data definitions while "class" refers to higher-level (user-defined) data definitions. This distinction is almost arbitrary, and in some contexts, "type" and "class" are actually synonymous." --Comment @dbdebunk.com

Sunday, November 20, 2016

The Principle of Orthogonal Database Design Part III




Note: This is a 11/24/17 re-write of Part III of a three-part series that replaced several previous posts (the pages of which redirect here), to to bring it in line with the McGoveran formalization and interpretation [1] of Codd's true RDM.
 
(Continued from Part II)

POOD and SQL


As we have seen, if relations are uniquely constrained, with a true RDBMS supporting logical independence (LI) and constraint inheritance, database design can adhere to the POOD and enable DBMS-enforced consistency. A RDBMS can also support ESS explicitly.

Industry misconceptions notwithstanding, SQL DBMSs are, of course, not relational. They have weak declarative integrity support, which, coupled with bad database designs, makes adherence to the POOD (as well as the other design principles) difficult. While even its weak relational fidelity was sufficient to render SQL superior to what preceded it, this is but one example of the many advantages of the RDM that SQL has failed to concretize.

Monday, November 14, 2016

This Week



I have revised:
THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS available to order here.

1. Quotes of the Week
"Wow. Been using SQL for 15 years, and didn’t even know about Except and Intersect!" --Blog.Jooq.org
"I am looking for database schemas with many tables (>100 tables). Where can I find them? I am currently using mysql and haven't done serious database design. So interested in looking at samples with ER diagrams." --YCombinator.com
2. To Laugh or Cry?

Monday, November 7, 2016

The Principle of Orthogonal Database Design Part II




Note: This is a 11/24/17 re-write of Part II of a three-part series that replaced several previous posts (the pages of which redirect here), to bring in line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

(Continued from Part I)

To recall from Part I, adherence to the POOD means independent base relations (i.e., not derivable from other base relations), which the design example in Part I,

EMPS (EMP#,ENAME,HIREDATE)
SAL_EMPS (EMP#,ENAME,HIREDATE,SALARY)
COMM_EMPS (EMP#,ENAME,HIREDATE,COMMISSION)

violates: EMPS is derivable via union of projections of SAL_EMPS and COMM_EMPS. It requires at least:
  • A disjunctive constraint on each of the SAL_EMPS and COMM_EMPS relations, to ensure mutual exclusivity;
  • A redundancy control constraint to prevent inconsistency due to partial updates (can you formulate it?);
  • Use of the transaction management component of the data language to ensure that each candidate tuple is properly inserted (1) into EMPS and (2) the correct subtype relation;

Monday, October 31, 2016

This Week



1. Quote of the Week

"Normalization is and will always be a direct trade-off. You give up performance (time) for space. Indexing mitigates in the opposite direction of this trade-off." --YCombinator.com

2. To Laugh or Cry?

Comments on my "Denormalization for Performance: Don't Blame the Relational Model"

3. THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS 

is available. Order here.

View My Stats