Machine Learning and Genetics

Editor’s note: This post first appeared in the quarterly magazine the DNA Decoder, a publication written by students, for students. The magazine is designed to help students across the nation connect with each other and share interesting ideas around genetics.

By Mahir Jethanandani

The human genome contains nearly three billion base pairs of genetic material, which if written out, would fill over 200 New York City telephone books (averaging 1000 pages each) ^[1].

Working with such huge datasets, as in the case of the human genome, requires scientists to use the most cutting-edge technology, to both sequence and analyze what makes these data so interesting.

The human genome is not only extremely large in size, but it is also remarkably complex: there are roughly ~20,000 genes and even more regions that control how these genes are expressed.

Small variations in these genes and regulatory regions are ultimately what makes each of us unique (and, unfortunately, sometimes results in disease).

The effects of these small variations, especially when they occur in combination with one another, are often difficult to identify.

While the Human Genome Project provided a wealth of information surrounding the genetic material that makes up humans, even over a decade later, scientists are still working to identify the connections between genotypic and phenotypic traits.

WHAT IS MACHINE LEARNING?

Machine learning is a modern day tool that has been increasingly popular to identify patterns and connections in large datasets. Broadly speaking, machine learning is a type of artificial intelligence where computers are programmed to improve their performance on a general task, or to “learn” on their own—given a starting dataset, which they can use to recognize important patterns.

One popular example of machine learning is IBM laboratory’s computer Watson, that was able to outperform even the best human contestants on Jeopardy.^[2] Machine learning has many applications to the modern-day world, and one very exciting application is to find patterns in personal genomic data.

HOW CAN MACHINE LEARNING BE APPLIED TO EXPLORING THE HUMAN GENOME?

In the context of personal genomics (the study of an individual’s unique human set of DNA), machine learning can be used to help find patterns in how small variations in genes and regulatory regions result in phenotypic changes (traits, wellness, and health) in a more automated fashion.

For example, knowing which genetic variants are commonly shared in individuals with traits of interest, like diabetes or hemophilia, allows computer scientists to leverage machine learning to more efficiently pinpoint where in the genome (and potentially why) these disorders may occur.

Entire companies and research departments across the globe dive into machine learning in the hopes of finding common patterns between people’s DNA and traits or disease.

Machine learning can help us identify underlying genetic factors for certain diseases by looking for genetic patterns amongst people with similar medical issues.

Many new insights of the human genome can be attributed to machine learning. For example, unsupervised learning, a type of machine learning algorithm, can cluster genes by their expression in cells and tissues and find the connection between genotypic and phenotypic patterns.

It can be also applied to improve sequencing methodologies. One such project is called DeepVariant.

DEEP VARIANT AS A RISING STAR

As our understanding of genetics grows, new tasks emerge to be solved.

Next-generation sequencing aims to reduce the time and resources required to read and digitize a person’s genomic sequence.

The repetitive task of genome sequencing can be optimized through the application of machine learning, especially when used for next-generation sequencing. The current genomic sequencing task is very error-prone, in that it can misread parts of DNA and make other crucial mistakes—complicating our ability to connect genotype to phenotype.

The Food and Drug Administration hosted the PrecisionFDA Truth Challenge in April 2016, which aimed to curb the error-impact of human genomic sequencing.^[3]

Google Research presented DeepVariant, their solution to next-generation sequencing.

DeepVariant went on to win the top awards for the advancement of next-generation sequencing.

DeepVariant improved the Genome Analysis Tool Kit (GATK), a popular genomic tool, by improving machine learning methodologies used in sequencing.^[4]

Without getting bogged down in the technical details, deep learning frameworks like TensorFlow and PyTorch allow for companies like Verily Life Sciences (the creators of DeepVariant) to improve the speed and accuracy of sequencing.

DeepVariant uses a subdivision of machine learning, called deep learning, to optimize a computer’s ability to find patterns in data unsupervised.

Such computation is difficult to understand, and the task of training the computer systems to learn “properly” is all the more difficult.

DEFINITIONS

Machine learning – a type of artificial intelligence that can be used to find patterns in data.

Unsupervised learning – a discipline of machine learning that learns from data without explicit labelling.

Genotype – the unique heritable genetic material of an individual (the usage of this term can refer to a single base pair all the way up to the entire genome or the entire set of DNA in a human).

Phenotype – the observable, physical characteristic(s) of an individual (can be trait, wellness, or health)

Human Genome Project – an international genomics project aimed at determining the first complete sequence of human DNA.

LOOKING FORWARD

DeepVariant and advancements in popularizing personal genomics come together to expand the applications of machine learning. More so, companies are opening an “app store” for other scientists and genetics enthusiasts to explore their own genomes, in relationship to health and livelihood.

As a scientific community, we are taking steps foward to connect genotype to phenotype—despite many challenges. Many patterns that help form genetic traits remain undiscovered, and machine learning specializes in pattern recognition that push the boundaries of human skill and knowledge.

[2] IBM’s Watson computer takes the Jeopardy! Challenge.

[3] Chin J. Simple Convolutional Neural Network for Genomic Variant Calling with TensorFlow. July 16, 2017; 1-3.

[4] Poplin R, Newburger D, Dijamco J, Nguyen N, Loy D, Gross S, McLean C.Y., DePristo M.A. Creating a universal SNP and small indel variant caller with deep neural networks. bioRxiv. Dec. 14, 2016

Mahir Jethanandani is a junior studying Computer Science, Statistics, and Economics at the University of California, Berkeley. He previously worked at the University of California, San Francisco Department of Neurology and Bioinformatics as a Machine Learning and Bioinformatics Research Intern. Mahir also interned at 23andMe as an Engineering Intern, exploring the world of personal genomics and its application to computer science, machine learning, and bioinformatics. Mahir graduated from UC Berkeley where he triple majored in Computer Science, Statistics and Economics. He is the author of “The Immaculate Investor” and “The Balance Sheet of Earth.” Mahir has worked with Benetech where he did volunteer work with Google for the United Nations. He is originally from Saratoga, California, and became inspired to explore genetics and bioinformatics after the passing of his grandfather.