Helping Expand Diversity in Genetic Health Research

Scientists with 23andMe and Google have collaborated to develop an imputation reference panel using 23andMe’s African American Sequencing Project to help expand diversity in genetic health research. This new reference panel enables researchers to more accurately identify associations between genetic variants and traits or health conditions within Black Americans and other non-European populations.

“This work will help address disparities in genetic research, as researchers outside of 23andMe will be able to use this data to inform their studies,” said Jared O’Connell, a 23andMe senior scientist and lead author for the paper. 

Expanding Analysis

All participating individuals were contacted and specifically consented to sharing their sequencing data via dbGaP with qualified researchers.

An imputation reference panel consists of genome-wide variation discovered through whole-genome sequencing in a set of individuals. Researchers often use such reference panels to perform genotype imputation. Unlike whole-genome sequencing, genotyping microarrays look at hundreds of thousands of sites that are known to vary between individuals, not the entire genome. Hence, there are hundreds of millions of such sites that are not assayed by microarrays.

Genotype imputation is the process of inferring genotypes at many of these missing sites. Much like a code breaker filling in missing letters in a message, scientists — using algorithms and data from a whole-genome sequenced reference panel — can predict, or impute, the missing letters of genetic data. This allows large-scale genetic studies to expand the number of genetic variants that can be analyzed. But currently, most publicly available reference panels only include people of European ancestry, which limits their applicability to other populations.

A Reference Panel for Research

In this effort, described in “A population-specific reference panel for improved genotype imputation in African Americans” published this month, the teams built an imputation reference panel with high levels of African ancestry. It is the first peer-reviewed publication derived from 23andMe’s African American Sequencing Project.  The panel includes genetic sequence data from more than 2,000 volunteers residing in the US, who identified as Black. With their consent, 23andMe created a reference dataset for research that consists of a de-identified reference panel and associated raw sequencing data. By including more individuals with non-European ancestry within the panel, imputation in non-Europeans is improved. For comparison, the commonly-used Haplotype Reference Consortium panel consists of 27,166 individuals which are predominantly of European descent. There is a subset of 2,001 individuals in that panel included from the 1000 Genomes Project, and just 661 of these have substantial African ancestry.

“This biased reference panel composition generally leads to substantially poorer imputation quality for non-Europeans than for Europeans,” according to the paper

DeepVariant

Additionally, the team leveraged Google’s state-of-the-art variant caller, DeepVariant. This allowed the researchers to further improve both the accuracy and richness of genetic variation in the reference panel. DeepVariant was compared with two versions of the popular GATK caller. It was shown to have a more accurate single sample calling performance as well as generate better performing imputation panels. Extensive comparisons of genotype refinement and phasing software were also conducted to establish a useful pipeline for creating imputation panels with DeepVariant. 

“The best practices for creating imputation panels we outline here are broadly applicable for improving imputation quality in non-European populations,” said senior author and Google staff software engineer Cory McLean.

This reference dataset was uploaded to the National Institute of Health’s dbGaP. It has already helped researchers studying conditions such as lymphoma, neurodegenerative disease, and cardiovascular disease. The reference dataset is also available to other qualified and vetted genetic scientists. It is meant to help expand research in Black Americans and other populations.

The team also published a blog post about the research here.