On Finding Needles in Genetic Haystacks


Nearly a year ago today, the Spittoon reported that researchers had figured out how to pick a known person’s DNA out of a database containing genetic information for up to 200 unidentified individuals. The feat made it possible to determine, for example, if a particular suspect’s DNA was present in a mix of several peoples’ collected at a crime scene. It also raised the possibility, at least theoretically, that somebody could figure out if a particular person’s DNA was part of a research database or some other pool of anonymous data.

Now another team of researchers has calculated the limits of this needle-in-a-haystack technique. It turns out that if the number of data points revealed is small enough, an individual’s DNA can remain undetectable.

Before UCLA geneticist Stan Nelson and his colleagues published their paper last year, it had been a common practice to publish summary data from research DNA collections so members of the genetics community could build on each others’ research. (Note: 23andMe does not share individual-level data without explicit consent, and is working on ways of distributing summary data without compromising privacy.) It was assumed that aggregation would make it impossible to tell whether a particular individual had contributed to that collection. But in the wake of the announcement of the new detection method, major genome centers swiftly removed summary data from public view to avoid the possibility of compromising the identity of any research participants.

In a paper published today in Nature Genetics, researchers from UC Berkeley take the first step towards a less extreme route to protecting individuals’ identities than simply withholding all information. The researchers developed methods to calculate precisely how much genetic information from a research DNA collection may be revealed without risk of exposing the identities of the study participants.

At the heart of the PLoS Genetics paper that started the hubbub was a statistical test that the authors used to show that they could reliably tell whether someone was present in a DNA mixture. The new study not only provides an improved version of this test, but proves that the new approach is the best of all possible detection methods. Since no better test could be devised, if theirs is unable to detect whether an individual was part of a mixture of DNA, they can be confident that no test would be able to do it.

Armed with this certainty, the authors explore just how much information from a research DNA collection may be reported without risk of divulging who was in the study. It turns out that for a collection of 1,000 individuals, up to 10,000 individual data points can be revealed with little risk of exposing anyone to identification.

This makes it somewhat easier for geneticists to share their data. But these days an individual’s genetic data typically consists of a genome-wide panel of 500,000 to a million single-letter DNA variations known as SNPs — so the problem still isn’t solved completely.

Return to top