**Pop Quiz:** What do women who eat cereal for breakfast each morning have in common?

- They get a full day’s supply of 11 essential vitamins and minerals.
- They enjoy better sex lives than women who don’t eat cereal for breakfast.
- They are more likely to give birth to male children.
- They make buying decisions based on the advice of cartoon leprechauns, silly rabbits and mustachioed mariners.

Actually, we don’t know anything about answers A, B and D. But we do know about C. When British scientists announced last year in the Proceedings of the Royal Society B that their research suggested eating breakfast cereal increases a woman’s chance of having a male child (among other things), they may have been falling prey to a statistical error that is of constant concern in the kinds of studies that 23andMe uses to provide information about the genetic components of disease risk. This post explains the nature of that misstep, and how we go about avoiding it.

Where did the researchers go wrong? Their research looked at the consumption of 133 different foods by 749 English mothers before and during the course of their pregnancies, and found that among all of these foods, only breakfast cereal was associated strongly with having sons. Nothing wrong with that. The problem arose, a pair of statisticians writes in the current online issue of the Proceedings of the Royal Society B, when the original researchers neglected to account properly for the number of opportunities they had had^{1} to find a significant result.

In statistics, this is called the ‘multiple testing’ or ‘multiple comparisons’ problem, and it’s a very important concern for us at 23andMe. It’s pretty easy to get a feel for what can go wrong: suppose you were to flip a quarter five times, each time noting whether it came up heads. The odds that you would see heads all five times are not good — they’re 31 to one, or about 3%. But if you repeated this ‘experiment’ 133 times, you would be very likely to see all-heads at least once, and more probably three to five times. So if you see all-heads a few times, it doesn’t really make sense to regard this outcome as unusual. In fact, it would be very unlikely not to see all heads at least a few times^{2}. It’s something like the mathematical version of Tom Petty’s wistful refrain: *Even the losers get lucky sometimes.*

At 23andMe, the genetic association studies underlying our Health and Traits reports are saddled with the mother of all multiple testing problems. The basic idea behind these studies is pretty simple. They look at single-letter DNA variations known as SNPs in, say, 1,000 people who have a disease. Then they look at those same SNPs in another 1,000 people who are similar to the first thousand in as many ways as possible, except that they do not have that disease. If researchers were to look at a SNP and find, for example, that 80% of the people with the disease have the AA version, and only 30% without the disease do, then they’d be justified in suspecting that they’d found something significant.

But here’s the rub. These studies don’t look at just one SNP; they typically look at half a million to a million. A million is a lot more than 133; if association studies didn’t correct for multiple testing, every one of them would be expected to find tens of thousands of falsely associated SNPs mixed in with the really associated ones. That would not do.

Fortunately, there are many ways to correct for the multiple testing problem, ranging from dead-simple to math-degree-required. The top genetics journals will accept an association study only if the analysis has corrected for multiple testing using a standard method. Since we at 23andMe will consider a paper for use in Health and Traits only if it has come from a top journal (with some exceptions), we usually don’t have to worry about whether the studies we rely on are statistically sound. And, we go a step further by insisting that any finding used in Health and Traits must have been replicated by at least two independent groups of researchers. This policy helps to guard both against possible failure of multiple testing correction and unintended error on the part of the researchers conducting the study.

So how did the breakfast cereal-boy baby study make it into print? That remains a bit of a mystery to this reporter-slash-scientist. But anyone who is dreaming of having their own baby boy will just have to rely on alternative types of lucky charms.

*The original paper looked at the womens’ diets before pregnancy and both in early and later pregancy, and tested each food in each time interval, so they appear to have conducted 399 (133*3) separate tests. They only report a significant association between consumption of breakfast cereal and bearing sons for the ‘before pregnancy’ and ‘later pregnancy’ time periods, so it is unclear whether they also tested the ‘early pregnancy’ time period. If they only tested ‘before’ and ‘later’, which seems odd, then that would mean they conducted 266 separate tests. Either number of tests, 266 or 399, suffices to cast serious doubts on the soundness of their result.**How unlikely? We want to know: what is the chance that in 133 5-coin-flip ‘experiments’, we never see all-heads? Well, per experiment,**given a fair coin, the probability of getting five heads in five coin-flips is 1/32 (= (1/2)^5), about 3.1%. Therefore, the probability of***not**seeing all-heads is one minus the chance of seeing all-heads, or 31/32, about 96.9%. This is exactly the same as the chance of seeing at least one tail in five flips. The chance, then, that you see at least one tail in each of 133 flips is (31/32)^133, which is 1.47%. If you doubled the number of experiments to 266, the chance of never seeing all-heads (ie, seeing at least one tail in each experiment) is (31/32)^266, or 0.021%. This fits with your intuition: as the number of experiments goes up, the chance of never seeing an all-heads outcome goes down, because you have increased the number of opportunities in which to see this outcome.