Note: My November post @All Analytics, which I reposted here.
Given
the overhyped promise of "data science", the "shock" at the broad
failure to predict the election outcome was inevitable. Skimming through the
media and technical accounts, it looks like a better understanding of
prediction and explanation is necessary for less surprises and sounder
analytics. Let's take two
examples (oversimplified somewhat to make the point).
First, a game-theoretic
account derived from observed behavior in a 2-player game in which one player
gets a sum of money and decides how to share it with another, who can only
accept or reject the offer: even though accepting any offer as better than
nothing is rational, "we don’t behave rationally ... [but] emotionally ...
we reject offers we consider unfair".
"… there’s been plenty of economic growth inside the U
S--vastly increasing the pile of money to be divided. But ... The first player
consists of those people who have benefitted from globalization and trade: the
“elites”, derisively referred to as “the 1%”. And the second player ...
everyone ... who aren’t in those upper income echelons ... are seeing the pile
of money in the game growing ever bigger. And ... the other player keeps an
ever-larger share of that pile for themselves ... Trump allowed them to channel
their feelings into a rejection of the proposal that has been made—on trade,
immigration, and globalisation, and dividing up those spoils ...[and they
threw] everything out". --What voters do when they
feel screwed--the economic theory)
Second,
a complex algorithm that runs a multitude of sophisticated simulations on a
"raft of carefully collected public and private polling numbers, as well
as ground-level voter and early voting data”. Assume that “the raft” consists
of, vote predictors—vote correlates discovered by computers (Whatdidn't Clinton’s data-driven campaign's algorithm named Ada see?).
Suppose
(1) an appropriate hypothesis in the form of a correlation at the aggregate
level between variables measuring affinity to the first and second player and
vote had
been derived in the former case which proved accurate and
(2) the algorithm in the latter case produced an equally accurate prediction. Is there any difference between the two approaches?
For
those who equate prediction with explanation, the answer is yes. For those for
whom explanation is about the past and prediction about the future, the
question does not come up. But these are views that obscure rather than
enlighten.
In
both cases there is a data pattern in the form of predictive correlations. In
the first case a theory of individual behavior specifies the causal
mechanism—the individual behavior that explains how the pattern is
produced--why it exists at the aggregate level. In the second case, the
mechanism is of no particular interest and is not specified. In general
explained behavioral predictions are more reliable than those without.
Data patterns discovered by computers explained can
produce insights—causal mechanisms— for theory development, this is what data
mining should be about. That's the context of discovery in science,
which requires predictions from the theory developed from the discovered
patterns to be tested in the
context of validation on different data. But because, unlike in
natural science, human behavior is not governed by unchanging universal laws,
it is easier to explain post-hoc than to predict. Given the pressure for
prediction in industry and politics, the temptation not to bother with the
second context is too strong.
In
this age of "big data", "data mining", "data lakes" and machine learning the
important difference between prediction and explanation should be understood
and kept firmly in mind when performing analytics and assessing their results.
See
also Unthinking
Machines.