A review of “Statistical Inference as Severe Testing”

[Posted as a comment on Andrew Gelman’s blog, under a post of “Reviews of Deborah Mayo’s new book”, Statistical Inference as Severe Testing.]

Very late to this party, but let me add a couple of cents of mine to the discussion. Trying to keep this review relatively short, I should preface it by saying that I found SIST to be a very mixed bag of some welcome passages pointing out some context, especially in Fisher and Popper, that is usually ignored. Unfortunately, that is completely undermined by Mayo’s all-out inductivism—or, to give it its full name, “the whole discredited farrago of inductivism” (Medawar).

Mayo’s launching-off point for SIST is this question: “How do humans learn about the world despite threats of error due to incomplete and variable data?” (SIST, xi) Her answer, in short, is: by severely testing our claims. This is, at least in part, based on the philosophy of Karl Popper, arguably the 20th century’s most important philosopher of science: “The term ‘severity’ is Popper’s, though he never adequately defined it.” (SIST, 9) Mayo proposes that she has actually found a previously missing adequate definition, and it is this: “If [a claim] C passes a test that was highly capable of finding flaws or discrepancies from C, and yet none or few are found, then the passing result, x, is evidence for C.” (SIST, 14) This Mayo does not want to be misunderstood as saying that, using this method, we find true or even probable statements: “I say inference C may be detached as indicated or warranted, having passed a severe test” (SIST, 65). And this, per Mayo, is explicitly an example of “ampliative or inductive reasoning” (SIST, 64), for which a variety of statistical methods (depending on context) can be used.

About those statistical methods, it is to Mayo’s credit that she stresses, right out of the gate, that one of the most pernicious uses of statistical methods had been anticipated and explicitly denounced by Fisher himself as early as 1935: “Fisher…denied that an isolated statistically significant result counts” (SIST, 4), going on to quote him saying that “[i]n relation to the test of significance” we need to “know how to conduct an experiment which will rarely fail to give us a statistically significant result”. That admonishment alone should have been enough, one would have thought, to preclude basing any far-reaching conclusions on a single study’s outcome (e.g. its p-value), such as “claiming a discovery”.

Similarly, it is very welcome for Mayo to point out (SIST, 82-3) that Popperian falsification is not achieved by noting one observation that contradicts a theory but only with the help of something that Popper called a “falsifying hypothesis”. It would have been helpful, however, to actually quote the relevant passage from Popper’s Logic of Scientific Discovery, as Mayo (partially) did in her earlier book (EGEK, 14):

We must clearly distinguish between falsifiability and falsification. …

We say that a theory is falsified only if we have accepted basic statements which contradict it. This condition is neces­sary, but not sufficient; for we have seen that non-reproducible single occurrences are of no significance to science. Thus a few stray basic statements contradicting a theory will hardly induce us to reject it as falsified. We shall take it as falsified only if we discover a reproducible effect which refutes the theory. In other words, we only accept the falsification if a low-level empirical hypothesis which describes such an effect is proposed and corroborated. This kind of hypothe­sis may be called a falsifying hypothesis. The requirement that the falsifying hypothesis must be empirical, and so falsi­fiable, only means that it must stand in a certain logical relationship to possible basic statements….

… If accepted basic statements contradict a theory, then we take them as providing sufficient grounds for its falsifi­cation only if they corroborate a falsifying hypothesis at the same time.

All well and good. But then, unfortunately, Mayo’s train of argument veers dangerously off course, eventually missing its intended target completely. One would certainly have to concur with her when she says, “The disagreements often grow out of hidden assumptions about the nature of scientific inference”. In Mayo’s case, at least, the assumptions aren’t all hidden. She is brazenly upfront about championing induction, for a start. “In getting new knowledge, in ampliative or inductive reasoning, the conclusion should go beyond the premises” (SIST, 64), she claims, and: “Statistical inference goes beyond the data – by definition that makes it an inductive inference.” (SIST, 7-8) Now, she is perfectly aware that induction has a bit of a problem: “It’s invalid, as is so for any inductive argument.” (SIST, 61) But that isn’t a problem, according to Mayo, on the contrary: “We must have strictly deductively invalid args to learn” (Twitter). Indeed, Mayo has confirmed in a separate conversation that she is “not talking of a logic of induction”. What, then, is she talking about when she talks about “inductive inference”? The answer: “error probabilities”, in line with her definition of severe tests. In fact, Mayo thinks that Popper foundered because “he never made the error probability turn” (SIST, 73).

One other assumption, however, is a little removed from plain view. This is Mayo’s assumption about the aim of science. This remains surprisingly vague, with “craving truth” (SIST, 7), “learn[ing] about the world” (xi), “getting new knowledge” (64), and “distinguish[ing] approximately correct and incorrect interpretations of data” (80) being our only hints as to what, in Mayo’s view, it is all about. Even more surprisingly for a professional philosopher, she never defines, or otherwise talks about, the terms ‘truth’, ‘learning’, and ‘knowledge’, as if they were self-explanatory and everybody was in agreement about their meaning—which they emphatically aren’t and obviously everybody isn’t. Other crucial terms, such as ‘inference’ and ‘hypothesis’, fare only a little better, getting a casually hand-wavy one-sentence definition each.

Not quite incidentally, these terms, and the concepts behind them, are crucial to any understanding of the philosophy of science—and especially Popper’s version of it. For all her professed sympathy for Popper’s philosophy, Mayo unfortunately either misunderstands or flat-out ignores some of the most central concepts in Popper’s philosophy. These, in fact, turn out to be the keys to a solution to the current crisis in the social sciences.

It all starts with Mayo’s ill-considered faith in induction. Popper emphatically denied that there was any kind of induction—not just that as a logical process it was invalid but also that any kind of inductive reasoning was used either for theory formation or in the production of knowledge. Mayo variously claims that Popper only rejected “enumerative induction”, that corroboration via falsifying hypotheses necessitates “an evidence-transcending (inductive) statistical inference” (SIST, 83), and even (without, I should add, being able to provide any evidence) that Popper actually “doesn’t object” to calling such an inference ‘inductive’—claims that range from the wildly mistaken to the outright preposterous. Compare, for example, Popper’s treatment of Baconian induction, which goes far beyond the enumerative kind (Logic, passim and 438); his footnote in § 22 of Logic explaining the concept of a ‘falsifying hypothesis’; and this almost derisive put-down: “It is clear that, if one uses the word ‘induction’ widely and vaguely enough, any tentative acceptance of the result of any investigation can be called ‘induction’.” This last reply could just as well have been directed at Mayo, who certainly uses ‘induction’ vaguely enough to warrant it.

Just as weirdly, Mayo seems to be unaware that Popper had a subtly but completely different aim in mind with respect to science. For her, it is about how “humans learn about the world” and how we “get new knowledge”. For him, the “central problem” is “the problem of the growth of knowledge” (Logic, Preface 1959). Popper’s aim is not to find new knowledge but ever better knowledge; the difference should be obvious after a moment’s thought: “new knowledge” doesn’t even so much as imply any coherence, let alone improvement. Popper understood very well that it’s impossible to judge whether a theory is per se near or far from some absolute truth; that’s why everything in his methodology is about making it possible to judge whether some theory is at least better than some other(s). Popper’s view—entirely correct, in my estimation—is that induction is not just useless, it is not even needed.

When she dismisses deductive logic, Mayo not-so-subtly shifts the goalposts from a critic’s observation that induction is not even valid to some variation of, ‘Oh but then deductive arguments don’t ensure soundness’ (i.e. truth). Well, that’s actually not what any argument does. What logic can do (iff we accept the principle of non-contradiction) is to let us force ourselves to make a choice—in the logician Mark Notturno’s phrase: “No argument can force us to accept the truth of any belief. But a valid deductive argument can force us to choose be­tween the truth of its conclusion on the one hand and the falsity of its premises on the other.” In a methodology that is about deciding which of two ideas is better, that is in fact all you need; again, Notturno:

If the purpose of an argument is to prove its conclusion, then it is difficult to see the point of falsifiability. For deductive arguments cannot prove their conclusions any more than inductive ones can.

But if the purpose of the argument is to force us to choose, then the point of falsifiability becomes clear.

Deductive arguments force us to question, and to reexamine, and, ultimately, to deny their premises if we want to deny their conclusions. Inductive arguments simply do not.

This the real meaning of Popper’s Logic of Scientific Discovery—and it is the reason, perhaps, why so many readers have misunderstood its title and its intent. The logic of discovery is not the logic of discovering theories, and it is not the logic of discovering that they are true.

Neither deduction nor induction can serve as a logic for that.

The logic of discovery is the logic of discovering our errors. We simply cannot deny the conclusion of a deductive argu­ment without discovering that we were in error about its premises. Modus tollens can help us to do this if we use it to set problems for our theories. But while inductive arguments may persuade or induce us to believe things, they cannot help us discover that we are in error about their premises.

Consequently, Mayo is similarly off the mark when she thinks science is about marking out “approximately correct” ideas (SIST, 80). By what standard? We don’t know, because Mayo didn’t bother to say what she takes ‘truth’ to mean. She also doesn’t mention that Popper had a different idea. In Notturno’s words: “The primary task of science is not to differentiate the true from the false—it is to solve scientific problems.” For Popper, scientific theories (and hypotheses, which are substantially the same thing) are about explanation; if there is no explanatory theory, there are no hypotheses and there is no knowledge. Mayo effectively turns all that completely on its head (EGEK, 11-2):

I want to claim for my own account that through severely testing hypotheses we can learn about the (actual or hypothetical) future performance of experimental processes—that is, about outcomes that would occur with specified probability if certain experiments were carried out. This is experimental knowledge. In using this phrase, I mean to identify knowledge of experimental effects (that which would be reliably produced by carrying out an appropriate experiment)—whether or not they are part of any scientific theory.

In this way, she empties all relevant terms of any possibly helpful meaning. “Inferences” are said to be “detached” by “induction”—but that is in no way meant to even imply any application of actual logic. As Notturno remarked: “Popper used to call a guess ‘a guess’. But inductivists prefer to call a guess ‘the conclusion of an inductive argument’. This, no doubt, adds an air of authority to it.” The same is, unfortunately, true for ‘hypothesis’, “or just ‘claim’”, which Mayo “will use…for any conjecture we wish to entertain” (SIST, 9)—explicitly, as she said earlier, “whether or not they are part of any scientific theory”. If you think that usage of ‘hypothesis’ carries rather strongly “the connotation of the wantonly fanciful”, Mayo specifically rules that in; Medawar, whose phrase that is, rather optimistically thought it was the bad old days when there was no “thought that a hypothesis need do more than explain the phenomena it was expressly formulated to explain. The element of responsibility that goes with the formulation of a hypothesis today was altogether lacking.” (Schilpp: The Philosophy of Karl Popper, 279) With respect to science’s being grounded in theories, Mayo is working mightily to resurrect an irresponsibility that was presumed happily dead long ago. Medawar quotes Claude Bernard with a prescient passage that seems all too fitting a description for what’s wrong with today’s social sciences: “A hypothesis is…the obligatory starting point of all experimental reasoning. Without it no investigation would be possible, and one would learn nothing: one could only pile up barren observations.” (Schilpp, 288)

To Mayo, though, that isn’t worth a single word. At least she is in good (or rather: numerous) company. Anything and everything to do with ‘theory’ is a huge blind spot in current social science. Everybody seems to be focused on bad, misunderstood, and allegedly broken statistics—and boy, is there a lot of bad and misunderstood statistics around. There are precious few voices in the wilderness, among them Denny Borsboom, who bluntly states: “It is a sad but, in my view, inescapable conclusion: we don’t have much in the way of scientific theory in psychology.” This “Theoretical Amnesia”, as he calls it, is the elephant in the room of the replication crisis. It even explains why some of the methodological suggestions to stem the tide of bad science, like preregistration, spectacularly miss the point:

And that’s why psychology is so hyper-ultra-mega empirical. We never know how our interventions will pan out, because we have no theory that says how they will pan out (incidentally, that’s also why we need preregistration: in psychology, predictions are made by individual researchers rather than by standing theory, and you can’t trust people the way you can trust theory).

Mind you, Borsboom himself thinks that some disciplines just “are low on theory”. Why that should be an inescapable fact, however, he does not say. And it goes without saying that any discipline we might care to think of today was in its history at a point where it was “low on theory”. Biology and chemistry had no theory to speak of as recently as roughly 150 ago. But it takes not just lots and lots of work (which people are obviously willing to put in) but also the good fortune to be working on problems that are rife for yielding answers—and an idea of what a unifying, explanatory theory actually looks like, which is no different in the social sciences than in any others. As it is, even Mayo’s book-length attempt to “get beyond the statistics wars” is hardly even a baby step. Even the “severe tester” of Mayo’s imagination remains condemned, in Bernard’s sadly apt phrase, “to wander aimlessly”.