23 July 2017 – Paul Meehl: “The Problem is Epistemology, not Statistics”
All papers etc. referred to on this page can be found in PDF format on bit.ly/Notturno-Seminar.
A paper co-authored by me highlights the need for some Popperian analysis of the recent replication crisis in the social sciences: “Falsificationism is not just ‘potential’ falsifiability, but requires ‘actual’ falsification: Social psychology, critical rationalism, and progress in science”
In 1961, at a conference of the German Sociological Association, Popper was asked to give a paper on “The Logic of the Social Sciences”. Popper’s main thesis was: “The method of the social sciences, like that of the natural sciences, consists in trying out tentative solutions to certain problems”. Rather predictably, Popper was accused by Adorno of ‘scientism’—which accusation Popper had laboriously tried to preempt by pointing out all the fallatious assumptions usually made by those using this term:
There is, for instance, the misguided and erroneous methodological approach of naturalism or scientism which urges that it is high time that the social sciences learn from the natural sciences what scientific method is. This misguided naturalism establishes such demands as: begin with observations and measurements; this means, for instance, begin by collecting statistical data; proceed, next, by induction to generalizations and to the formation of theories. It is suggested that in this way you will approach the ideal of scientific objectivity, so far as this is at all possible in the social sciences. In so doing, however, you ought to be conscious of the fact that objectivity in the social sciences is much more difficult to achieve (if it can be achieved at all) than in the natural sciences. For an objective science must be ‘value-free’; that is, independent of any value judgment. But only in the rarest cases can the social scientist free himself from the value system of his own social class and so achieve even a limited degree of ‘value freedom’ and ‘objectivity’.
Every single one of the theses which I have here attributed to this misguided naturalism is in my opinion totally mistaken: all these theses are based on a misunderstanding of the methods of the natural sciences, and actually on a myth—a myth, unfortunately all too widely accepted and all too influential. It is the myth of the inductive character of the methods of the natural sciences, and of the character of the objectivity of the natural sciences.
What Popper was saying, in short, was: If you want to understand how any science works, you’ll have to understand that induction (and its corollary of certain knowledge as the aim of science) will have to be given up and that there is such a thing as objectivity, but it doesn’t flow from a “well-purged mind” but is the result of a social critical process.
Hayek, in his Nobel Prize Lecture of 1974, said much the same thing about ‘scientism’ as Popper, in a way that reminds one of the current predicament in the social sciences: “in the social sciences often that is treated as important which happens to be accessible to measurement”, condemning “the superstition that only measurable magnitudes can be important”. And even further: “I confess that I prefer true but imperfect knowledge, even if it leaves much indetermined and unpredictable, to a pretence of exact knowledge that is likely to be false.”
The social sciences, meanwhile, are dealing with what has been termed a replication crisis. One the first high-profile articles to highlight a problem in the social sciences was Ioannidis’s “Why Most Published Research Findings Are False”, in which he said:
Several methodologists have pointed out that the high rate of nonreplication (lack of confirmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance […].
Ioannidis specifically blames over-reliance on p-values for the crisis: “Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles should be interpreted based only on p-values.” He goes on to point out that prior probability, statistical power, and effect size can influence a test in such a way that even a result of p ≤ 0,05 may have a probability of being true of less than 50 %.
Other authors offer similar objections. Button et al., for example, make a point about statistical power: “A study with low statistical power has a reduced chance of detecting a true effect, but it is less well appreciated that low power also reduces the likelihood that a statistically significant result reflects a true effect.” Colquhoun adds that the 5 % threshold for significance tests should be drastically lowered: “if you wish to keep your false discovery rate below 5 %, you need to use a three-sigma rule, or to insist on p ≤ 0.001.”
Krantz puts his finger on a widespread misconception: “A common error in this type of test is to confuse the significance level actually attained (for rejecting the straw-person null) with the confirmation level attained for the original theory.” Better statistical tools might help overcome this problem: “Statistics could help researchers avoid this error by providing a good alternative measure of degree of confirmation.”
Trafimow, who banned p-values altogether in Basic and Applied Social Psychology, the journal he edits, takes this line of reasoning further than anybody else by declaring NHST as such invalid:
As has been pointed out […] but ignored by the majority of quantitative psychologists […], the probability of the finding, given that the null hypothesis is true (again, this is p), is not the same as the probability of the null hypothesis being true, given that one has obtained the finding. …
Remember that the goal of the significance test is to reject the null hypothesis, which means it needs to be demonstrated to have a low probability of being true. …
It is now widely accepted by those quantitative researchers who are mathematically sophisticated and have expertise about the null hypothesis significance testing procedure, that it is invalid.
Unfortunately, the first two statements are factually wrong, such that his conclusion shows nothing more than that badly done and wrongly interpreted methods are invalid.
But even more importantly, it is the supposed aim of social scientific research that is problematic. Button et al., for example, have this to say:
[T]he lower the power of a study, the lower the probability that an observed effect that passes the required threshold of claiming its discovery (that is, reaching nominal statistical significance, such as p < 0.05) actually reflects a true effect. This probability is called the PPV of a claimed discovery.
This focus on being able to claim a discovery, on finding an effect, on confirming a theory is unambiguously and unabashedly defended by Colquhoun, who seems to think that not relying on inductive reasoning in science would be utterly absurd:
The problem of induction was solved, in principle, by the Reverend Thomas Bayes in the middle of the 18th century. He showed how to convert the probability of the observations given a hypothesis (the deductive problem) to what we actually want, the probability that the hypothesis is true given some observations (the inductive problem). …
Science is an exercise in inductive reasoning: we are making observations and trying to infer general rules from them. Induction can never be certain. In contrast, deductive reasoning is easier: you deduce what you would expect to observe if some general rule were true and then compare it with what you actually see. The problem is that, for a scientist, deductive arguments don’t directly answer the question that you want to ask.
This totally misunderstands the problem of induction, wrongly assumes that the probability of a hypothesis being true is of any import, and absurdly claims that Bayes’s theorem (if applied correctly) has anything to do with induction.
Now, Meehl pointed pretty much all of this out 50 years ago. In his “The Problem is Epistemology, not Statistics”, he says:
Significance tests have a role to play in social science research but their current widespread use in appraising theories is often harmful. The reason for this lies not in the mathematics but in social scientists’ poor understanding of the logical relation between theory and fact, that is, a methodological or epistemological unclarity.
And in his 1967 paper:
The writing of behavior scientists often reads as though they assumed—what it is hard to believe anyone would explicitly assert if challenged—that successful and unsuccessful predictions are practically on all fours in arguing for and against a substantive theory. …
Inadequate appreciation of the extreme weakness of the test to which a substantive theory T is subjected by merely predicting a directional statistical difference d > 0 is then compounded by a truly remarkable failure to recognize the logical asymmetry between, on the one hand, (formally invalid) “confirmation” of a theory via affirming the consequent in an argument of form: [T ⊃ H1, H1, infer T], and on the other hand the deductively tight refutation of the theory modus tollens by a falsified prediction, the logical form being: [T ⊃ H1, ~H1, infer ~T].
Finally, it is almost comical that the inventor of NHSTs, R. A. Fisher, had for the most part preempted both the misuse of NHSTs and the inductive thinking that is so prevalent in today’s social sciences. In his The Design of Experiments, Fisher puts this in almost Popperian terms:
In relation to any experiment we may speak of this hypothesis as the “null hypothesis,” and it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.
And, of course, Fisher also made sure to underline (in “The Arrangement of Field Experiments”) that no single test or study should be taken as grounds to claim anything, much less a “discovery”: “A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.” The reason for this is so obvious that it really should not need to be said:
No such selection can eliminate the whole of the possible effects of chance coincidence, and if we accept this convenient convention, and agree that an event which would occur by chance only once in 70 trials is decidedly “significant,” in the statistical sense, we thereby admit that no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the “one chance in a million” will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us.
The problem in the social sciences, it turns out, is really both epistemology and statistics. Fisher was more than explicit in pointing out the pitfalls. Meehl tried to remind his profession not to disregard them. Both have had less than stellar success.