Monday, June 01, 2009

Lies, damned lies and statistics

Matt Taibbi picks up the story about the study noting:
a highly positive correlation between dealer survival and Clinton donors[.] Granted, that P-Value (0.125) isn’t enough to reject the null hypothesis at 95% confidence intervals (our null hypothesis being that the effect is due to random chance), but a 12.5% chance of a Type I error in rejecting a null hypothesis (false rejection of a true hypothesis) is at least eyebrow raising.

(Taibbi's article is noteworthy for the quip, "Hell, if you want to punish a Chrysler dealer, it seems to me that the best thing to do is force him to keep trying to sell Chryslers.")

Finding such a p-value "eyebrow raising" reveals an inexcusible ignorance of statistics. First of all, the 95% confidence interval is not a high bar or a "gold standard". It is, rather, a relatively low standard, a rule of thumb to indicate whether some correlation is worth investigating further. One in twenty studies where the null hypothesis is actually true will find results good to 95% by chance. That we found a one in eight chance to accept the null hypothesis indicates the correlation is not worth investigating further.

Second, the original authors admit they "matched dealer data against several variables including" (but presumably not limited to) seven specific criteria (party affiliation, donations to three candidates and "other", donation amount and zip code). When you compare several variables, you are doing several studies. Even assuming they calculated only seven different possible correlations, the probability that one of them would have achieved a p-value of 12.5% by chance is extremely high. There are statistical tests, such as Tukey's test, that correctly account for doing multiple comparisons. The authors do not report the results of any multiple comparison analyses.

I realize that even my two-week tutelage under a statistician gives me a better understanding of statistics than most scientists (and perhaps many professional statistiticans), but really, it's completely indefensible and evidence of nothing but statistical illiteracy to see this study as having anything but a completely negative result.

Update: The hypothesis that the Obama administration would favor Clinton donors (p 0.125) more strongly than Obama donors (p 0.509), and treat Republican (p 0.636) and Democratic (p 0.676) donors equally is wildly implausible. It's hard to interpret drawing a causal conclusion as anything but incompetence or dishonesty.

1. My grad thesis adviser insisted we set our confidence interval at something like 99%, otherwise we had to assume that there was too much chance in order to reject the null. She was fucking hardcore.

2. Well said. A complete and utter ignorance of statistical significance (and what it means) is one of the things that irks me to no end.

3. A statistical side note: A p-value is not a measure of effect size. It's determined by multiple variables and does not by itself tell us anything at all about how large a putative effect might be.

4. I was taught that the p-value is the probability of some statistic occurring by chance in choosing the sample, assuming a normal distribution of the population.

Assuming (by pure charity) the authors have calculated the p-value both meaningfully and accurately, by definition it's very likely that one of their seven tests would have shown a p-value of 0.125 (1/8) by chance regardless of any other considerations.

5. Note that my use of "more strongly" was pure metaphor, without any statistical meaning implied.

6. Assuming (by pure charity) the authors have calculated the p-value both meaningfully and accurately, by definition it's very likely that one of their seven tests would have shown a p-value of 0.125 (1/8) by chance regardless of any other considerations.
.
Yup, that's the issue of multiple tests on the same dataset you rightly identified in the OP. It's a common and invidious error.

7. Steve Jones6/11/09, 1:38 PM

Attempting to find correlations on several different hypothesis after the event is called sub-group analysis and is completely erroneous (at least at the same confidence level). Come up with your theory in advance, test it and come up with a 95% confidence level and you might - just might have something. Test the same data against 20 different theories and come up with a 95% confidence level against one and very likely you've found nothing at all as you would, on average, expect one such result. Look hard enough, try enough theories and you can match any level of confidence.

So it is most likely complete rubbish.

Please pick a handle or moniker for your comment. It's much easier to address someone by a name or pseudonym than simply "hey you". I have the option of requiring a "hard" identity, but I don't want to turn that on... yet.

With few exceptions, I will not respond or reply to anonymous comments, and I may delete them. I keep a copy of all comments; if you want the text of your comment to repost with something vaguely resembling an identity, email me.

No spam, pr0n, commercial advertising, insanity, lies, repetition or off-topic comments. Creationists, Global Warming deniers, anti-vaxers, Randians, and Libertarians are automatically presumed to be idiots; Christians and Muslims might get the benefit of the doubt, if I'm in a good mood.

See the Debate Flowchart for some basic rules.

Sourced factual corrections are always published and acknowledged.

I will respond or not respond to comments as the mood takes me. See my latest comment policy for details. I am not a pseudonomous-American: my real name is Larry.

Comments may be moderated from time to time. When I do moderate comments, anonymous comments are far more likely to be rejected.