[Update: This study was formally retracted by Science on July 22, 2011]
A genome-wide association study of extreme longevity published last week in the journal Science has been receiving a lot of press attention. The results are quite extraordinary: the authors identify 70 loci with genome-wide significant evidence for association with living past the age of 100, and they construct a SNP-based model for predicting exceptional longevity (EL) that has 77% accuracy in an independent set of individuals. We were initially very excited by the article and thought it would be of great interest to the 23andMe community. However after a closer reading of the article and supporting materials, we think this study actually inadvertently points to some of the pitfalls in analyses of genome wide datasets.
There are several reasons for skepticism about these new results. Another recent genome-wide study has reported no significant associations with longevity. There is suggestive evidence of genotyping quality control problems in the new results, and some routine quality control checks do not appear to have been done. The design of the study is particularly susceptible to introducing biases into the results. And a preliminary analysis of the proposed 150-SNP model for predicting longevity indicates that it is not predictive in the 23andMe community.
We expect that most of the results of this study will not have the same longevity as its participants. In genetics, as with most things in life, if a result seems too good to be true, it probably is. That said, this study does contain some interesting tidbits, such as the association of near the ApoE gene with longevity. This SNP has previously been shown to be associated with Alzheimer’s Disease. For the time being we won’t be incorporating the data from this new study into our Personal Genome Service or putting information about any of the other SNPs here in the Blog. But we will be on the lookout for other attempts to replicate the study’s findings. If and when such a replication is published, we’ll scrutinize it like all of the papers we cover, and we’ll let you know what we find.
(A more technical discussion of the issues follows)
– A large study combining results of four genome-wide association studies of longevity was published in May in the Journals of Gerontology. That study found no associations meeting their pre-specified criteria for genome-wide significance. While they used a more inclusive phenotype (age 90 or older), it is surprising that there could be so many loci associated with survival to age 100 in the new study, some with very large effect sizes, yet none were found in the larger study from earlier this year.
– An important part of designing an experiment is choosing the criteria that will constitute convincing evidence of a positive finding. We have to make a trade-off between setting the bar so high that we will miss many interesting and true results (false negatives), or setting the bar so low that we will get many spurious findings (false positives). In the new study, the authors use an unusually permissive standard for genome-wide significance that appears to allow a false positive rate of 6 per 100,000 SNPs tested, or 18 false positives across the whole study. A more conventional standard for significance is to control the false positive rate at 0.05, genome-wide, meaning that we expect only a 5% chance of even one spurious finding.
– None of the strong associations appear to be supported by any evidence from nearby SNPs. Each of the reported associations stands alone, but typically, we expect that nearby SNPs will show some intermediate evidence for association, because nearby SNPs tend to be correlated. This is a red flag because genotyping quality issues can produce these kinds of uncorrelated association signals.
– Many of the associated loci have high rates of missing genotypes in the EL individuals compared to the controls. For the 70 genome-wide-associated SNPs, the median missing data rate in EL samples was 9%, compared to 3% in controls. Some of the SNPs with high EL call rates appear to have large deviations from Hardy-Weinberg equilibrium in the EL group. For instance, the two SNPs with strongest evidence for association, and , are far out of equilibrium (P < 10-20). Both of these are suggestive of data quality problems that can produce false associations, though there can be other valid explanations for the Hardy-Weinberg results.
– While the authors performed a replication of their results in an independent set of EL samples, the replication may share some biases with their initial genome scan. The authors drew most of their controls for their initial scan from a reference dataset, and used this same source of controls in their replication. And both sets of EL samples were genotyped with the same method. They show that substituting an alternate set of controls makes little difference, but this does not rule out a genotyping issue in the EL data.
We took a preliminary look in our customer data to see if the proposed SNP-based model described in Sebastiani et al. is predictive of exceptional longevity. A commonly used measure of test discrimination is to calculate how often, for a randomly selected case and control, a test correctly assigns a higher score to the case. This is known as the “c statistic” or “area under the curve”. The authors of the new study say their model scored a 0.93 for this statistic. But when we compared 134 23andMe customers with age ≥ 95 to more than 50,000 controls, we obtained a test statistic of 0.532, with a 95% confidence interval from 0.485 to 0.579. Using 27 customers with age ≥ 100, we get a value of 0.540, with a 95% confidence interval from 0.434 to 0.645. A random predictor of longevity would give a 0.5 on this scale, so based on our data, performance of this model is not significantly better than random. Even with our small sample size, we can also clearly exclude values as high as the published result of 0.93.
Study designs that use independently collected control genotype data require extra attention to quality control to rule out the possibility of systematic differences in genotyping between cases and controls. In any experiment that tests hundreds of thousands of SNPs, a small proportion of SNPs can be expected to have problems with automated genotype assignment. This may not be a problem if cases and controls are affected equally, but if cases and controls are genotyped separately, then errors or missing genotypes may be concentrated in just the cases or just the controls, and these can give the appearance of a relationship between genotypes and phenotypes that is actually an experimental artifact.
The results could be strengthened if the authors could inspect the raw data for the associated SNPs to verify that genotypes are being assigned consistently. Differential missingness in the EL samples is a major issue because if it is the result of poor clustering, it will almost always tilt the apparent genotype frequencies of affected SNPs. If the clustering does suggest that there is a “batch effect”, then there are some strategies that can be used to rescue the analysis. One approach is to aggressively filter the data using strict quality criteria and manual inspection of putative associations. Another approach is to directly model the batch effect and then test for association conditional on the batch structure of the data. This only works if the problem is not perfectly confounded with the phenotype. If that is the case, the only resolution may be to use another technology to verify genotypes of associated SNPs and fill in missing values.
Addendum:
We repeated our analysis restricted to individuals with European ancestry. The results were similar: for 129 customers with age ≥ 95 and more than 43,000 controls, we got a test statistic of 0.534, and for 26 customers with age ≥ 100, we got a value of 0.558. In both cases, the 95% confidence interval includes 0.5.