To follow this blog by email, give your address here...

Tuesday, November 23, 2010

The Psi Debate Continues (Goertzel on Wagenmakers et al on Bem on precognition)

A few weeks ago I wrote an article for H+ Magazine about the exciting precognition results obtained by Daryl Bem at Cornell University.

Recently, some psi skeptics (Wagenmakers et al) have written a technical article disputing the validity of Bem's analyses of his data.

In this blog post I'll give my reaction to the Wagenmakers et al (WM from here on) paper.

It's a frustrating paper, because it makes some valid points -- yet it also confuses the matter by inappropriately accusing Bem of committing "fallacies" and by arguing that the authors' preconceptions against psi should be used to bias the data analysis.

The paper makes 3 key points, which I will quote in the form summarized here and then respond to one by one

POINT 1

"
Bem has published his own research methodology and encourages the formulation of hypotheses after data analysis. This form of post-hoc analysis makes it very difficult to determine accurate statistical significance. It also explains why Bem offers specific hypotheses that seem odd a priori, such as erotic images having a greater precognitive effect. Constructing hypotheses from the same data range used to test those hypotheses is a classic example of the Texas sharpshooter fallacy
"

MY RESPONSE

As WM note in their paper, this is actually how science is ordinarily done; Bem is just being honest and direct about it. Scientists typically run many exploratory experiments before finding the ones with results interesting enough to publish.

It's a meaningful point, and a reminder that science as typically practiced does not match some of the more naive notions of "scientific methodology". But it would also be impossibly cumbersome and expensive to follow the naive notion of scientific methodology and avoid exploratory work altogether, in psi or any other domain.

Ultimately this complaint against Bem's results is just another version of the "file drawer effect" hypothesis, which has been analyzed in great deal in the psi literature via meta-analyses across many experiments. The file drawer effect argument seems somewhat compelling when you look at a single experiment-set like Bem's, and becomes much less compelling when you look across the scope of all psi experiments reported, because the conclusion becomes that you'd need a huge number of carefully-run, unreported experiments to explain the total body of data.

BTW, the finding that erotic pictures give more precognitive response than other random pictures, doesn't seem terribly surprising, given the large role that sexuality plays in human psychology and evolution. If the finding were that pictures of cheese give more precognitive response than anything else, that would be more strange and surprising to me.


POINT 2

"
The paper uses the fallacy of the transposed conditional to make the case for psi powers. Essentially mixing up the difference between the probability of data given a hypothesis versus the probability of a hypothesis given data.
"

MY RESPONSE

This is a pretty silly criticism, much less worthy than the other points raised in the WM paper. Basically, when you read the discussion backing up this claim, the authors are saying that one should take into account the low a priori probability of psi in analyzing the data. OK, well ... one could just as well argue for taking to account the high a priori probability of psi given the results of prior meta-analyses or anecdotal reports of psi. Blehh.

Using the term "fallacy" here makes it seem, to people who just skim the WM paper or read only the abstract, as if Bem made some basic reasoning mistake. Yet when you actually read the WM paper, that is not what is being claimed. Rather they admit that he is following ordinary scientific methodology.


POINT 3

"
Wagenmakers' analysis of the data using a Bayesian t-test removes the significant effects claimed by Bem.
"

This is the most worthwhile point raised in the Wagenmakers et al paper.

Using a different sort of statistical test than Bem used, they re-analyze Bem's data and they find that, while the results are positive, they are not positive enough to pass the level of "statistical significance." They conclude that a somewhat larger sample size would be needed to conclude statistical significance using the test they used.

The question then becomes why to choose one statistical test over another. Indeed, it's common scientific practice to choose a statistical test that makes one's results appear significant, rather than others that do not. This is not peculiar to psi research, it's simply how science is typically done.

Near the end of their paper, WM point out that Bem's methodology is quite typical of scientific psychology research, and in fact more rigorous than most psychology papers published in good journals. What they don't note, but could have is that the same sort of methodology is used in pretty much every area of science.

They then make a series of suggestions regarding how psi research should be conducted, which would indeed increase the rigor of the research, but which a) are not followed in any branch of science, and b) would make psi research sufficiently cumbersome and expensive as to be almost impossible to conduct.

I didn't dig into the statistics deeply enough to assess the appropriateness of the particular test that WM applied (leading to their conclusion that Bem's results don't show statistical significance, for most of his experiments).

However, I am quite sure that if one applied this same Bayesian t-test to a meta-analysis over the large body of published psi experiments, one would get highly significant results. But then WM would likely raise other issues with the meta-analysis (e.g. the file drawer effect again).

Conclusion

I'll be curious to see the next part of the discussion, in which a psi-friendly statistician like Jessica Utts (or a statistician with no bias on the matter, but unbiased individuals seem very hard to come by where psi is concerned) discusses the appropriateness of WM's re-analysis of the data.

But until that, let's be clear on what WM have done. Basically, they've

  • raised the tired old, oft-refuted spectre of the file drawer effect, using a different verbiage from usual
  • argued that one should analyze psi data using an a priori bias against it (and accused Bem of "fallacious" reasoning for not doing so)
  • pointed out that if one uses a different statistical test than Bem did [though not questioning the validity of the statistical test Bem did use], one finds that his results, while positive, fall below the standard of statistical significance in most of his experiments

The practical consequence of their latter point is that, if Bem's same experiments were done again with the same sort of results as obtained so far, then eventually a sufficient sample size would be accumulated to demonstrate significance according to WM's suggested test.

So when you peel away the rhetoric, what the WM critique really comes down to is: "Yes, his results look positive, but to pass the stricter statistical tests we suggest, one would need a larger sample size."

Of course, there is plenty of arbitrariness in our conventional criteria of significance anyway -- why do we like .05 so much, instead of .03 or .07?

So I really don't see too much meat in WM's criticism. Everyone wants to see replications of the experiments anyway, and no real invalidity in Bem's experiments, results or analyses was demonstrated.... The point made is merely that a stricter measure of significance would render these results (and an awful lot of other scientific results) insignificant until replication on a larger sample size was demonstrated. Which is an OK point -- but I'm still sorta curious to see a more careful, less obviously biased analysis of which is the best significance test to use in this case.

15 comments:

Matthew Fuller said...

Okay, so if you believe in psi anomalies, which is rational in my opinion given the evidence, what is your opinion of remote viewing? In particular:
http://www.farsight.org/demo/Multiple_Universes/MUP_Session_Download_Page_November2009.html

tl;dr The remote viewers sketch a missile launch which is consistent with the Russian report of the spiral anomaly. This was done by being given a number which corresponds to an event which had not been chosen yet. After the remote viewing session is over, an event is chosen, in this case the spiral anomaly, which is obviously outside the control of the organization, so no leaking of info. was possible.

jimrandomh said...

The fact that other scientific fields are also abusing statistics does not make it okay, because it does not make the conclusions that result from statistical abuses true. The choice of which statistical test to use is not arbitrary, and using the wrong one is as bad as writing down the wrong value for a low-order digit; you can get away with it when the effect size is large, but not here.

The problem with meta-analyses is that they're biased, not just by the file drawer effect, but also by incorporating the occasional incorrect procedures and falsified datasets. Selection effects can also occur at multiple granularities than a single published paper, with researchers conducting a series of runs, discarding some of the negative ones, and combining the remaining results into a single published study. Since all the biasing effects are cumulative, the number of unpublished negative studies necessary to explain the data is much smaller than you think.

Matthew Fuller said...

thats just an assertion jim that you think the file drawer effect matters more than supporters of psi think.

Ben Goertzel said...

jimrandomh wrote


The fact that other scientific fields are also abusing statistics does not make it okay, because it does not make the conclusions that result from statistical abuses true. The choice of which statistical test to use is not arbitrary, and using the wrong one is as bad as writing down the wrong value for a low-order digit; you can get away with it when the effect size is large, but not here.



Of course it's not OK to use the wrong statistical technique.

Please note, though, that Wagenmakers et al did NOT claim that Bem used the "wrong statistical technique" in his paper.

They suggested a different statistical technique, but they didn't dispute Bem's choice of technique, or claim that it was "wrong."

Matthew Fuller asked:


Okay, so if you believe in psi anomalies, which is rational in my opinion given the evidence, what is your opinion of remote viewing?


Based on what I've read, I would guess it probably is a real phenomenon, though I haven't studied it as carefully as precognition or ganzfeld experiments.

EJ said...

Hi Ben,

As you mention throughout your discussion of our paper, we did not intend to critique Bem's work specifically -- instead, we feel that this work highlights some of the more general problems in the field.

Unfortunately, these problems are especially damaging to people who want to make a claim for psi. Check for yourself -- how many tests did Bem compute for the data from Experiment 1?

Yes, other work in science sometimes also proceeds along these lines -- but that should hardly increase our confidence in Bem's results.

I disagree that the solutions we propose at the end of the paper are too strict; it is easy to find a skeptic to work with, and it *should* be easy to determine in advance what test you will do. This seems a small price to pay for increased credibility.

Cheers,
E.J. (Wagenmakers)

Anonymous said...

When considering the possibility of a claim, we can ask the question "how could this be true" but also "what if this is true?". The answer to the second question seems pretty easy (as the WM paper points out) -- if precognition existed, then we would see it used everywhere for personal gain.

Just one example -- even with a few second window into the future, one could make a huge fortune trading S&P futures if one was well connected with wall street insiders with fast trade executions. And one person could do this, then others would soon also be using it which would fundamentally alter the stock market. I don't think this is happening, so by this fact alone we can conclude that precognition does not exist in the proportions that Bem reports.

Ben Goertzel said...

A reply to Anonymous about psi and trading....

Actually, in practice it's not that easy to make a winning trading system from a predictive algorithm with, say, 53% odds of predicting the correct market direction, and a high volatility. You can do it but you need careful attention to money management rules and other details of trading system construction.

So hypothetically, if a person had a slight edge over the stock market via psi, they could make $$ from this if A) they were *also* good traders with good trading instincts and a knowledge of how to execute a trading system, or B) they used their psi-driven signals within a well-tuned black-box trading system of some kind.

If there are also machine-learning or statistics based trading systems out there which can predict the price direction 53% of the time (but often with high volatility), which I do believe ... then the psi-based systems wouldn't necessarily be distinguishable from the ML-based systems in terms of their market impact.

So I don't agree with your claim that, if psi of the magnitude Bem's experiments suggest were real, the financial markets would look different than they do today.

Ben Goertzel said...

A reply to EJ.

Thanks for taking the trouble to comment on my blog.

I want to add that I think your paper falls on the side of "intelligent, responsible criticism of psi research" rather than "knee-jerk, intellectually dishonest pseudo-skepticism" -- so, thanks for that!

Intelligent criticism of the type made in your paper, on balance, moves psi research forward rather than just creating noise and trouble like some of the crappier pseudo-skeptical criticism does...

Having said that, I still don't think your recommendations regarding psi research are terribly practical, though I agree that's how things would work in an ideal world. In practice, packaging up one's research software for others to use takes a lot of work, and finding serious-minded, properly-educated, not-egregiously-biased skeptics to monitor one's work requires plenty of effort too.... It's hard enough to find time and funding for psi research, without adding all these other requirements on.

I think what Bem did was OK, and I think that to be taken really seriously his results will need to be replicated by others using the software he provided. A couple successful replications would lead to many more, I suppose, which would eventually provide the larger sample sizes needed to achieve significance according to your own preferred significance test.

I hope to find time to dig into the statistical analysis of Bem's data myself at some point ... my PhD is in math so I have the chops to do it, but currently I just don't have the time. And I'm kinda hoping Jessica Utts will do it first, as I have a lot of faith in her insight into this kind of issue.

Robin Zimmermann said...

I'm not really qualified to comment on most of this, but a point: You said that the paper found the results were positive, but of lower significance ... but I believe the summary of the results of the Bayesian t-test were summarized in Table 2, and of the ten tests:

* One returned a 'substantial' result in favor of psi,
* Three returned 'anecdotal' results in favor of psi,
* Three returned 'anecdotal' results in favor of chance - that is, against psi, and
* Three returned 'substantial' results in favor of chance.

Contrary to your summary, the average result measured by that method is outright negative, although not significantly.

Ben Goertzel said...

Robin Zimmerman:

OK, I should have worded it more carefully.....

This gets down to the difficulty of translating mathematics into English in an effective way.

According to the verbiage in the WM paper, "anecdotal evidence in favor of chance" means that the incidence of psi effects in the data -- though greater than the *average* one would expect if psi did not exist -- is not enough greater than the average to allow statistical confirmation that psi exists according to their chosen test (but it's *almost* enough greater...).

What I meant to say is: WM are not disputing that the Bem results show psi effects more often than one would expect, on average, if psi did not exist. But they dispute that the difference between the actual results and the "expected in the absence of psi" results are sufficient to conclude psi exists... except in the case of one of the experiments.

gregory said...

You folks all need to meet more mystics.

Scesses said...

Thanks for an interesting analysis, and it was good also to have a comment from one of the authors of the paper.

One thing that bothered me in the paper was a footnote arguing that the possible compatibility of psi and modern physics doesn't prove the truth of psi.

The problem is that nobody claims that, so it's an utterly irrelevant point. When people attempt to show that psi is consistent with quantum physics, it's either to try to find a mechanism explaining the data, or it is in answer to the common objection that psi can't exist since it is against the laws of physics (or both). Now, the lack of a mechanism for psi is a serious shortcoming (though much more common in science than people tend to believe), so it's not bad to look for one. And to answer the objection that x is impossible due to its incompatibility with y by trying to show its compatibility with y also seems quite rational. So that footnote was really bad (which of course says nothing about the more interesting statistical points made in the article).

Mark Waser said...

Really, Ben?!?? Shame on you! The scientific method is *very* clear about how SCIENCE is done. You create a hypothesis, you design an experiment and how it is going to be analyzed, and then you perform the experiment and analysis EXACTLY how it was designed. Trawling data for hypotheses is EXPLORATION, also a very worthwhile pursuit but NOT science. To DO science, you take the results of (hypothesis suggested by) that EXPLORATION and design a NEW experiment and analysis (or even simply redo an old one) and see if the new results confirm your hypothesis. Scientists do indeed do both exploration and science -- but the good ones understand the difference and don't fill the literature with confused garbage that mistakes one for the other. The key to good science is reproducibility. If the hypothesis is there, just do an experiment to prove it. Until then, it's just a hypothesis drawn post hoc from data and NOT science at all (i.e. just because something is done by a self-proclaimed "scientist" or even a "real" scientist does not mean that it is science).

Ben Goertzel said...

A very good and detailed response to Wagenmakers et al is here...

http://dl.dropbox.com/u/8290411/ResponsetoWagenmakers.pdf

It addresses the details of the Bayes factor calculations (showing that Wagenmakers et al used a fucked-up prior, and if you use a more sensible one, you get highly significant results according to the Bayes factors).

It also addresses philosophy of science issues like Mark Waser raises, making clear that the experiments Bem posed were based on hypotheses formed by study of prior psi and psych experiments. It's not true that he collected a bunch of data aimlessly then trolled it for promising-looking results.

interior dapur said...

Hello Benjamin,

I really love the template you are using. It's really a cool template, very minimalist and loading is also very fast. I really liked the look of the first template like this. Success always for you. contoh advertisement text | cara menata ruang tamu | contoh makalah strategi pemasaran | contoh judul makalah