Recently, some psi skeptics (Wagenmakers et al) have written a technical article disputing the validity of Bem's analyses of his data.
In this blog post I'll give my reaction to the Wagenmakers et al (WM from here on) paper.
It's a frustrating paper, because it makes some valid points -- yet it also confuses the matter by inappropriately accusing Bem of committing "fallacies" and by arguing that the authors' preconceptions against psi should be used to bias the data analysis.
The paper makes 3 key points, which I will quote in the form summarized here and then respond to one by one
Bem has published his own research methodology and encourages the formulation of hypotheses after data analysis. This form of post-hoc analysis makes it very difficult to determine accurate statistical significance. It also explains why Bem offers specific hypotheses that seem odd a priori, such as erotic images having a greater precognitive effect. Constructing hypotheses from the same data range used to test those hypotheses is a classic example of the Texas sharpshooter fallacy
As WM note in their paper, this is actually how science is ordinarily done; Bem is just being honest and direct about it. Scientists typically run many exploratory experiments before finding the ones with results interesting enough to publish.
It's a meaningful point, and a reminder that science as typically practiced does not match some of the more naive notions of "scientific methodology". But it would also be impossibly cumbersome and expensive to follow the naive notion of scientific methodology and avoid exploratory work altogether, in psi or any other domain.
Ultimately this complaint against Bem's results is just another version of the "file drawer effect" hypothesis, which has been analyzed in great deal in the psi literature via meta-analyses across many experiments. The file drawer effect argument seems somewhat compelling when you look at a single experiment-set like Bem's, and becomes much less compelling when you look across the scope of all psi experiments reported, because the conclusion becomes that you'd need a huge number of carefully-run, unreported experiments to explain the total body of data.
BTW, the finding that erotic pictures give more precognitive response than other random pictures, doesn't seem terribly surprising, given the large role that sexuality plays in human psychology and evolution. If the finding were that pictures of cheese give more precognitive response than anything else, that would be more strange and surprising to me.
The paper uses the fallacy of the transposed conditional to make the case for psi powers. Essentially mixing up the difference between the probability of data given a hypothesis versus the probability of a hypothesis given data.
This is a pretty silly criticism, much less worthy than the other points raised in the WM paper. Basically, when you read the discussion backing up this claim, the authors are saying that one should take into account the low a priori probability of psi in analyzing the data. OK, well ... one could just as well argue for taking to account the high a priori probability of psi given the results of prior meta-analyses or anecdotal reports of psi. Blehh.
Using the term "fallacy" here makes it seem, to people who just skim the WM paper or read only the abstract, as if Bem made some basic reasoning mistake. Yet when you actually read the WM paper, that is not what is being claimed. Rather they admit that he is following ordinary scientific methodology.
Wagenmakers' analysis of the data using a Bayesian t-test removes the significant effects claimed by Bem.
This is the most worthwhile point raised in the Wagenmakers et al paper.
Using a different sort of statistical test than Bem used, they re-analyze Bem's data and they find that, while the results are positive, they are not positive enough to pass the level of "statistical significance." They conclude that a somewhat larger sample size would be needed to conclude statistical significance using the test they used.
The question then becomes why to choose one statistical test over another. Indeed, it's common scientific practice to choose a statistical test that makes one's results appear significant, rather than others that do not. This is not peculiar to psi research, it's simply how science is typically done.
Near the end of their paper, WM point out that Bem's methodology is quite typical of scientific psychology research, and in fact more rigorous than most psychology papers published in good journals. What they don't note, but could have is that the same sort of methodology is used in pretty much every area of science.
They then make a series of suggestions regarding how psi research should be conducted, which would indeed increase the rigor of the research, but which a) are not followed in any branch of science, and b) would make psi research sufficiently cumbersome and expensive as to be almost impossible to conduct.
I didn't dig into the statistics deeply enough to assess the appropriateness of the particular test that WM applied (leading to their conclusion that Bem's results don't show statistical significance, for most of his experiments).
However, I am quite sure that if one applied this same Bayesian t-test to a meta-analysis over the large body of published psi experiments, one would get highly significant results. But then WM would likely raise other issues with the meta-analysis (e.g. the file drawer effect again).
I'll be curious to see the next part of the discussion, in which a psi-friendly statistician like Jessica Utts (or a statistician with no bias on the matter, but unbiased individuals seem very hard to come by where psi is concerned) discusses the appropriateness of WM's re-analysis of the data.
But until that, let's be clear on what WM have done. Basically, they've
- raised the tired old, oft-refuted spectre of the file drawer effect, using a different verbiage from usual
- argued that one should analyze psi data using an a priori bias against it (and accused Bem of "fallacious" reasoning for not doing so)
- pointed out that if one uses a different statistical test than Bem did [though not questioning the validity of the statistical test Bem did use], one finds that his results, while positive, fall below the standard of statistical significance in most of his experiments
The practical consequence of their latter point is that, if Bem's same experiments were done again with the same sort of results as obtained so far, then eventually a sufficient sample size would be accumulated to demonstrate significance according to WM's suggested test.
So when you peel away the rhetoric, what the WM critique really comes down to is: "Yes, his results look positive, but to pass the stricter statistical tests we suggest, one would need a larger sample size."
Of course, there is plenty of arbitrariness in our conventional criteria of significance anyway -- why do we like .05 so much, instead of .03 or .07?
So I really don't see too much meat in WM's criticism. Everyone wants to see replications of the experiments anyway, and no real invalidity in Bem's experiments, results or analyses was demonstrated.... The point made is merely that a stricter measure of significance would render these results (and an awful lot of other scientific results) insignificant until replication on a larger sample size was demonstrated. Which is an OK point -- but I'm still sorta curious to see a more careful, less obviously biased analysis of which is the best significance test to use in this case.