In choosing the grounds upon which a general hypothesis should be rejected, the exprimenter will rightly consider all points on which, in the light of current knowledge, the hypothesis may be imperfectly accurate, and will select tests, so far as possible, sensitive to these possible faults, rather than to others. [47]

# Tag Archive: statistics

## The misuse of significance tests

The examples elaborated in the foregoing sections of numerical discrepancies arising from tbe rigid formulation of a rule, which at first acquaintance it seemed natural to apply to all tests of significance, constitute only one aspect of the deep-seated difference in point of view which arises when Tests of Significance are reinterpreted on the analogy of Acceptance Decisions. It is indeed not only numerically erroneous conclusions, serious as these are, that are to be feared from an uncritical acceptance of this analogy.

An important difference is that Decisions are final, while the state of opinion derived from a test of significance is provisional, and capable, not only of confirmation, but of revision. An acceptance procedure is devised for a whole class of cases. No particular thought is given to each case as it arises, nor is the tester’s capacity for learning exercised. A test of significance on the other hand is intended to aid the process of learning by observational experience. [100]

## Unnatural science

If a broad line of demarcation is drawn between the natural sciences and what can only be described as the unnatural sciences, it will at once be recognized as a distinguishing mark of the latter that their practitioners try most painstakingly to imitate what they believe—quite wrongly, alas for them—to be the distinctive manners and observances of the natural sciences. Among these are:

(a) the belief that measurement and numeration are intrinsically praiseworthy activities (the worship, indeed, of what Ernst Gombrich calls *idola quantitatis*);

(b) the whole discredited farrago of inductivism—especially the belief that facts are prior to ideas and that a sufficiently voluminous compilation of facts can be processed by a calculus of discovery in such a way as to yield general principles and natural-seeming laws;

(c) another distinguishing mark of unnatural scientists is their faith in the efficacy of statistical formulas, particularly when processed by a computer—the use of which is in itself interpreted as a mark of scientific manhood. There is no need to cause offense by specifying the unnatural sciences, for their practitioners will recognize themselves easily: the shoe belongs where it fits. [167]

## Fisher on Bayesianism

[A]dvocates of inverse probability seem forced to regard mathematical probability, not as an objective quantity measured by observable frequencies, but as measuring merely psychological tendencies, theorems respecting which are useless for scientific purposes. [6-7]

## Fisher on significance tests

In considering the appropriateness of any proposed experimental design, it is always needful to forecast all possible results of the experiment, and to have decided without ambiguity what interpretation shall be placed upon each one of them. Further, we must know by what argument this interpretation is to be sustained. …

It is open to the experimenter to be more or less exacting in respect of the smallness of the probability he would require before he would be willing to admit that his observations have demonstrated a positive result. It is obvious that an experiment would be useless of which no possible result would satisfy him. Thus, if he wishes to ignore results having probabilities as high as 1 in 20—the probabilities being of course reckoned from the hypothesis that the phenomenon to be demonstrated is in fact absent … . It is usual and convenient for the experimenters to take 5 per cent. as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results. No such selection can eliminate the whole of the possible effects of chance coincidence, and if we accept this convenient convention, and agree that an event which would occur by chance only once in 70 trials is decidedly “significant”, in the statistical sense, we thereby admit that no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the “one chance in a million” will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to *us*. In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. [12-4]

## So you did one study? Do some more.

If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent. point), or one in a hundred (the 1 per cent. point). Personally, the writer prefers to set a low standard of significance at the 5 per cent. point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment *rarely fails* to give this level of significance. The very high odds sometimes claimed for experimental results should usually be discounted, for inaccurate methods of estimating error have far more influence than has the particular standard of significance chosen. [504-5]

## Weak statistical tests

The distinction between the strong and the weak use of significance tests is logical or epistemological; it is not a statistical issue. The weak use of significance tests asks merely whether the observations are attributable to “chance” (i.e., no relation exists) when a weak theory can only predict some sort of relation, but not what or how much. The strong use of significance tests asks whether observations differ significantly from the numerical values that a strong theory predicts, and it leads to the fourth figure of the syllogism—*p* ⊃ q, ~q , infer ~p—which is formally valid, the logician’s *modus tollens* (“destroying mode”). Psychologists should work hard to formulate theories that, even if somewhat weak, permit derivation of numerical point values or narrow ranges, yielding the possibility of *modus tollens* refutations. [422]

## Induction, philosophy’s toughest zombie

Science is an exercise in inductive reasoning: we are making observations and trying to infer general rules from them. Induction can never be certain. In contrast, deductive reasoning is easier: you deduce what you would expect to observe if some general rule were true and then compare it with what you actually see. The problem is that, for a scientist, deductive arguments don’t directly answer the question that you want to ask.

## The problem is epistemology, not statistics

Significance tests have a role to play in social science research but their current widespread use in appraising theories is often harmful. The reason for this lies not in the mathematics but in social scientists’ poor understanding of the logical relation between theory and fact, that is, a methodological or epistemological unclarity. Theories entail observations, not conversely. Although a theory’s success in deriving a fact tends to corroborate it, this corroboration is weak unless the fact has a very low prior probability and there are few possible alternative theories. The fact of a nonzero difference or correlation, such as we infer by refuting the null hypothesis, does not have such a low probability because in social science everything correlates with almost everything else, theory aside. In the “strong” use of significance tests, the theory predicts a numerical point value, or narrow range, so the hypothesis test subjects the theory to a grave risk of being falsified if it is objectively incorrect. In general, setting up a confidence interval is preferable, being more informative and entailing null hypothesis refutation if a difference falls outside the interval. Significance tests are usually more defensible in technological contexts (e.g., evaluating an intervention) than for theory appraisal. [393]

## Inductive psychology vs deductive physics

Contrast this bizarre state of affairs with the state of affairs in physics. While there are of course a few exceptions, the usual situation in the experimental testing of a physical theory at least involves the prediction of a *form* of function (with parameters to be fitted); or, more commonly, the prediction of a quantitative magnitude (point-value). Improvements in the accuracy of determining this experimental function-form or point-value, whether by better instrumentation for control and making observations, or by the gathering of a larger number of measurements, has the effect of *narrowing* the band of tolerance about the theoretically predicted value. What does this mean in terms of the significance-testing model? It means: *In physics, that which corresponds, in the logical structure of statistical inference, to the old-fashioned point-null hypothesis H _{0} is the value which flows as a consequence of the substantive theory T;* so that an increase in what the statistician would call “power” or “precision” has the methodological effect of stiffening the experimental test, of setting up a more difficult observational hurdle for the theory T to surmount. Hence, in physics the effect of improving precision or power is that of

*decreasing*the prior probability of a successful experimental outcome if the theory lacks verisimilitude, that is, precisely the reverse of the situation obtaining in the social sciences.

As techniques of control and measurement improve or the number of observations increases, the methodological effect in physics is that a successful passing of the hurdle will mean a greater increment in corroboration of the substantive theory; whereas in psychology, comparable improvements at the experimental level result in an empirical test which can provide only a progressively weaker corroboration of the substantive theory.

In physics, the substantive theory predicts a point-value, and when physicists employ “significance tests,” their mode of employment is to compare the theoretically predicted value x_{0} with the observed mean x_{0}, asking whether they differ (in either direction!) by more than the “probable error” of determination of the latter. Hence H : H_{0} = *μ*_{x} functions as a point-_{0} shrinks, values of x_{0} consistent with x_{0} (and hence, compatible with its implicans T) must lie within a narrow range. In the limit (zero probable error, corresponding to “perfect power” in the significant test) any non-zero difference (x_{0} – x_{0}) provides a *modus tollens* refutation of T. If the theory has negligible verisimilitude, the logical probability of its surviving such a test is negligible. Whereas in psychology, the result of perfect power (i.e., certain detection of any non-zero difference in the predicted direction) is to yield a prior probability *p* = ½ of getting experimental results compatible with T, because perfect power would mean guaranteed detection of whatever difference exists; and a difference [quasi] always exists, being in the “theoretically expected direction” half the time if our substantive theories were all of negligible verisimilitude (two-urn model). [112-3]

## Recent Comments