Sunday, 3 February 2013

The Honorton meta-analysis of Ganzfeld experiments 1985

Following from my post about the 1999 ganzfeld meta-analysis of Milton and Wiseman, I thought I'd write about another meta-analysis of the same type of experiments. In 1985 Charles Honorton wrote a meta-analysis of ganzfeld experiments from 1974-1982.

"The composite (Stouffer) Z score for the 28 studies is 6.6 (p < 10^ -9), and 43% of the studies were independently significant at the 5% level."

This Stouffer (unweighted) z of 6.6 is the one most commonly quoted in articles commenting on Honorton's findings. But as we saw with Milton and Wiseman's paper, choices regarding statistical measure and inclusion criteria can alter this figure quite radically.

The ganzfeld experiments have enough different aspects that, like the sliders on a graphic equalizer, you can adjust to get the desired result. It illustrates a sort of Heisenberg's Principle for statistics, where the more someone knows about a particular subject, the less able they are to quantify it with any accuracy. And I should emphasise that this is the point I am trying to make: the subjective element of meta-analyses can be considerable.

The Inclusion Criteria

Honorton based his figures on a sub-set of 28 experiments taken from the 42 experiments discussed in Hyman's 1985 paper. Among these 42 experiments, a number of different scoring systems were used. This, claimed Hyman, could lead to a problem where an experiment that initially used one method of scoring later found that a different method gave a better result, and reported that instead. As Stanford (1984) summarized:

“For whatever reason, many ganzfeld researchers have, historically speaking, seemed very unsure of what method to use to evaluate overall ESP performance. Many have used at least two, and sometimes more, methods of analysis. This common failure to settle, on logical, a priori grounds, upon a single method of analysis makes it difficult to decide whether ESP has occurred in any study where multiple analyses have been used with divergent outcomes.”

Honorton agreed with Hyman that the issue of multiple analysis was a problem, and so he decided to conduct his meta-analysis using only one scoring method. Namely, the Direct Hit method, which was the most common.

However, did this really address the issue? The experiment that Hyman used to illustrate the problem (York, 1980 which used Order Ranking as its main measure of success and Direct Hit as a secondary one, but only reported the statistically significant Direct Hit results) is still included in the database. So I think the problem still remains, especially when you consider that the Direct Hit score can be derived from the data for other measures such as Binary Hits, Sum Of Ranks or Z-score Ratings, so it may be too much of a temptation for an experimenter to report a significant or more positive result on this scale alongside the other measures. Honorton included any experiment that reported Direct Hits, whether they were the primary measure or not.

Honorton's choice of Direct Hits may make sense at first glance since it includes the results from the majority of experiments (ie, 28 out of 42). However, it does not include the majority of the data (835 trials out of 2,567) and it is worth looking at the data that Honorton removed.

As a whole, the missing 14 experiments contain 1,612 trails with a Stouffer z of -0.01 (ie, fractionally below chance). Eleven of the fourteen reported results in a numerical form, the other three simply said the experiment was unsuccessful so in my calculations, a z-score of zero was awarded.

If we combine these fourteen with Honorton's database, the unweighted z-score falls from 6.6 to 5.2.

Statistical issues

The Milton and Wiseman meta-analysis was criticised for using a method of combining scores that did not take into account the size of each experiment. Since Honorton uses the same method, it seems valid to apply the same adjustment here. Once we use a weighted z-score, the result drops to 2.72 (odds of around 1 in 303).

So with these two really quite uncontroversial decisions (include all data, and choose a more appropriate statistical measure) the result has fallen quite dramatically.

And once you have a certain amount of knowledge about the database, it's very easy to find more ways to push the result down even further. Now, I should reiterate that this isn't about the evidence for psi per se, but it does indicate that there is no single correct answer.

Methodological issues

A set of results from Cambridge were famously criticised by Blackmore (as well as Parker & Wiklund and C.E.M. Hansel) and as a result were removed by Jessica Utss in her analyses of the ganzfeld data (Utts, 1999, 2010). So if you take the example of Utts and remove the data from Cambridge then the weighted z-score falls even further, down to 2.18 (odds of around 1 in 69).

Small scale experiments

In calculating each z-score, a binomial distribution is used. Since this is not applicable to experiments with small numbers of trials (Wikipedia suggests trials multiplied by chance probability (mostly 0.25 in this case) is less than 5, so I'll use that) we can remove all experiments with less that 20 trials. This reduces the weighted z-score to 2.09 (1 in 54)

[note: changed the wording of the above paragraph after some comments indicated it wasn't clear. Hope it is now. I can't get blogspot to deal with even the simplest algebraic symbols]

In fact, it would be quite simple to write up a meta-analysis using these criteria as if they were perfectly sensible choices made by an impartial observer before any calculations were attempted. The truth is that sometimes I would try excluding a class of experiments, only to find that it pushed the result up again. I simply ignored that, and tried something else. In fact, this exercise has made me far more skeptical of meta-analyses than I am of the existence of ESP.

Towards non-significance

So, what hoops would a skeptic need to jump through in order to reduce the results to chance (or near chance)? Despite such a considerable drop so far, it is actually quite difficult to get the result down much more.

It is necessary to include all the experiments up until 1984 (ie, up to the year before the publication of Honorton's meta-analysis) and then take out two experiments by Honorton and Terry which had been criticised on methodological grounds by Kennedy.

This puts the weighted z-score at 1.78 (1 in 27) although the unweighted z-score is now, for once, lower than the weighted at 0.61 (approximately 1 in 4) so a really cheeky skeptic could reinstate the statistical measure they'd abandoned at the start because it inflated the score!

BLACKMORE, S., (1987) "A Report of a Visit to Carl Sargent's Laboratory", Journal of the Society for Psychical Research, 54, pp 186-198
HANSEL, C.E.M, (1985) "The Search for a Demonstration of ESP", "A Skeptic's Handbook of Parapsychology", ed. Paul Kurtz, pp97-128
HONORTON, C., (1985) "Meta-Analysis of Psi Ganzfeld Resarch: A Response to Hyman", Journal of Parapsychology 49, pp 51-91
HYMAN, R., (1985) “The Ganzfeld Psi Experiment: A Critical Appraisal”, Journal of Parapsychology 49, pp 3-50
KENNEDY, J.E., (1979) “Methodological Problems in Free-Response ESP Experiments”, Journal of the American Society for Psychical Research, vol 73, pp 1-15
MURRAY, A. L., (2011) “The Validity Of The Meta-Analytic Method In Addressing The Issue Of Psi Replicability", Journal of Parapsychology, vol 75:2
PARKER, A., WIKLUND, N. (1987) “The ganzfeld experiments: towards an assessment”, Journal of the Society for Psychical Research, 54, pp 261-265
STANFORD, R.G., (1984) “Recent Ganzfeld-ESP Research: A Survey and Critical Analysis”, Advances in Parapsychology 4, pp 83-111
UTTS, J. (1999) " The Significance of Statistics in Mind-Matter Research", Journal of Scientific Exploration, Vol. 13, No. 4, pp.615-638
UTTS, J. (2010) "The Strength of Evidence Versus the Power of Belief: Are We All Bayesians?"
YORK, M. (1977). “The defense mechanism test (DMT) as an indicator of psychic performance as measured by a free-response clairvoyance test using a ganzfeld technique”, Research in parapsychology, 1976, pp. 48-49

Software used for statistics was Meta-Analysis 5.3 by Ralf Schwarzer


Johann said...

You say: "Eleven of the fourteen reported results in a numerical form, the other three simply said the experiment was unsuccessful so in my calculations, a z-score of zero was awarded."

Awarding a z-score of 0 to unsuccessful experiments ignores the potential that they were positive but not significant; perhaps you should merely have exlcuded the studies that reported no numerical values.

Max said...

Hi Ersby,

You wrote,

"In calculating each z-score, a binomial distribution is used. Since this is not applicable to experiments with small numbers of trials (Wikipedia suggests np less than 5, so I'll use that) remove all experiments with less that 20 trials. This reduces the unweighted z-score to 2.09 (1 in 54)."

But if you accept that the binomial distribution is valid for N > 5, then it is not valid to arbitrarily remove studies with less than 20 trials. How many studies among the 42 had N < 5?

Ersby said...

Hello, Johann.

There is a danger in ignoring experiments that give less detail about their results. Namely: less successful experiments tend to be reported in less detail than successful ones. To exclude experiments because they did not give enough information will almost certainly bias the result of a meta-analysis upwards.

Ersby said...

And hello Max.

I should apologise: my writing wasn't clear. The trouble was, in trying to use the mathematical "less than" sign, blogspot kept trying to parse it as an html tag. After a lot of trouble with that one sentence, my patience ran out and I left it a little less clear than it could be.

np < 5 means "number of trials" multiplied by "probability expected by chance". In the case of the ganzfeld, most experiments had a chance expectation of 0.25, so any experiment with n less than 20 would've been too small for binomial distribution.

Johann said...

Hi Ersby,

You wrote: "less successful experiments tend to be reported in less detail than successful ones. To exclude experiments because they did not give enough information will almost certainly bias the result of a meta-analysis upwards."

The experiments report that they were unsuccessful. Therefore, we know that they can each have one of three possible values: z < 0, z = 0, or z > 0. You may argue that z = 0 is the most balanced of these choices, but I personally don't see the rationale of this decision. Under the alternative hypothesis, for an effect with generally low power, it is expected that many experiments will "fail", but if you exclude these "failed" experiments (i.e. experiments which went in the predicted direction, but failed to reach significance), you defeat one of the main purposes of meta-analysis: to heighten to power to detect small effects which are not sufficiently robust to reach significance in individual experiments.

As for your contention that excluding the three studies would potentially bias the meta-analysis, I think it unlikely. Already unsuccessful experiments are not more likely to be reported in less detail if they have a z < 0; in fact, if they are significantly negative (the worst exclusion scenario), they would probably be analyzed in MORE detail, with several post hoc findings to boot.

Anyway, just some thoughts.

Ersby said...

Thanks for your comments, but I do think that the risks of excluding studies which do not fully report their work is far greater than the risk of researchers having a positive result but reporting it as "unsuccessful".

With the policy of reporting all results, good or bad, that parapsychologists generally follow, it is only human nature that negative results (even significantly so) are not written up with the same enthusiasm as positive results. And this is the pattern that I've seen in the literature.

I went back and looked at the three studies in question, just to find their exact wording.

Parker, Miller, Beloff (1977) wrote "Overall results in terms of ESP scores [...] were close to chance"

Stanford (1979) "The mean ESP score was close to chance"

Palmer, Whitson, Bogart (1980) "The overall mean ESP score was below chance"

Given this, I think that giving three experiments a z-score of zero is the most accurate option.

jt512 said...

Ersby, nice analysis demonstrating the sensitivity of the meta-analysis to the inclusion/exclusion criteria.

I just found your blog, and look forward to reading more articles.

jt512 said...

Regarding studies reported only as "unsuccessful," imputing a z-score of 0 tacitly assumes that the alternative hypothesis is true, because if the null hypothesis were true, the average unsuccessful trial would be the mean of the portion of the normal density below z=1.645, which would be negative. Any z-score above this value chosen to represent the average z-score of the unsuccessful studies (including some negative z-scores) therefore implies a true alternative hypothesis. Thus, the choice of 0 seems reasonable.