When Whole Genome Sequencing Doesn’t Give Us the Whole Genome

Sunday, March 1, 2015

Image courtesy of Shutterstock

If you’re an avid follower of popular science in today’s news media, you might have noticed a recurring theme. Genomics is everywhere. On an almost weekly basis, the New York Times, the New Yorker, Forbes and a myriad of other outlets are publishing stories with overly optimistic ledes about doctors and gene sequencers being replaced by apps and iPhone accessories. You would be forgiven if you thought genomics was “solved” and we’re 5 years out from creating a Star Trek inspired “tricorder” that near instantly sequences your genome and tells you, without equivocation, what malady is afflicting you and how exactly to overcome said disorder. The fact of the matter is that we’re not there yet, not by a long shot.

Precision Medicine, personalized genomics, genomic medicine (and any of the other hundred permutations) are the “It” kids of 2015. This has been underscored by Obama’s recently announced Precision Medicine Initiative but we’ve been leading up to this over the past few years with major public and private initiatives into genomics. The idealized goal here is to unify electronic medical records and phenotypes with sequence derived genotypes to ease diagnostic decision-making in the clinic, provide better care to patients, and develop targeted treatments that minimize or eliminate negative side effects. However, the time-table for the realization of these innovations is up for debate. And while being a popular topic in the media is generally a good thing, the problem is that the ideal we’re trying to reach with precision medicine initiatives is exceptionally complicated and short sighted visions of the future do more harm to our efforts than good.

As an example, it is incredibly frustrating for me to read opinion articles like the one recently run in the New York Times entitled: ‘Moonshot’ Medicine Will Let Us Down. In this piece, Dr. Michael A. Joyner excoriates genomics and expounds on its failure to deliver the medical revolution the scientific community promised back when we started the laborious process of sequencing the human genome in 1990. In the end Dr. Joyner says we’d be better off investing money in studying social and environmental determinants of disease because genomics has little chance of providing actionable medical information for a majority of the population. I definitely agree with him about needing to focus more attention on social and environmental factors, but by no means does that eliminate the need to continue funding large projects in genomics. It is clear that both genes and the environment play significant roles in disease etiology so favoring the study of one over the other would be a misstep. I do also agree with Dr. Joyner on the limitations of personalized medicine, today, because we really are only able to provide answers to a small fraction of the population that carry a genetic predisposition to very rare diseases. However, I do not believe that this will at all be true in the long term and I couldn’t help but be reminded of Clifford Stolls’ epic Newsweek article from 1995, Why the Web Won't Be Nirvana, wherein he predicted that the internet is just a fad that will never amount to anything. It’s hard not to see a parallel here with Dr. Joyner’s article, sure, like the internet pundits in the 80’s, we’ve made some bold claims over the years, but to get to the genomic Nirvana we need more data, better data, and better algorithms for analyzing that data. All of this will take time and come at a substantial cost. Unfortunately, genomics doesn’t have the same luxury of virtually cost free advancement that the internet has benefitted from over the past three decades. A 16 year old can’t pick up a genome sequencer and develop the next killer app over his high school lunch break. I think most of us in the field now recognize that the personalized medicine revolution is not going to occur on a 5 year time scale and it is not going to produce broadly applicable therapies for every human on this planet in overtly significant ways in the short term.

Common types of sequence and structural variants

This, however, in no way means that we have not made significant advancements. To the contrary, we have made important progress in detecting single nucleotide polymorphisms (SNP) and insertions/deletions (indels) that negatively affect the genome. Because these types of mutations are readily detected by the current iteration of genome sequencers, we have seen an explosion of papers in the genomics community that causally associate these mutations with disease. Recent surveys of clinical applications of exome and genome sequencing show that sequencing of patients that present with likely genetic disorders results in diagnosis in 30-50% of cases (depending on the specifics of the technique used) and in many cases this results in a change to the care that is given to those patients. In even rarer circumstances we can actually provide care that significantly improves a patient’s standard of living (See Cara Green, KCNT1). Sequencing has also been shown to be helpful in the development and targeting of drugs in cancer patients and even more recently, studies with human “knockouts” have found druggable targets that could have significant implications for human health. As an example, a recent survey discovered that people who lacked the gene PCSK9 had much lower cholesterol levels and a number of companies have now made a mad rush to develop therapies that specifically inhibit this protein. There are many other examples of important discoveries like this in pharmacogenomics which seeks to determine who will or will not respond to targeted drug therapies (See IL-28b, Warfarin, etc). Although we have made what I would consider enormous strides since the sequencing of the human genome, we still don’t obtain the complete genomic picture, and the predominant technique that we use for sequencing the genome today is unable to detect major, likely important, mutations.

$A. An example of a repeat expansion with a visualization of short read pile-up. Because we don\'t have long sequences to span repetitive gaps, it is impossible to know the length of this expansion. B An example of an inversion. The only way we can detect this with short reads is by scanning for specific inversion points. Otherwise sequences will align to the reference as if no inversion has occurred. C Similar to an inversion, a translocation only provides clues to this variant at the break point. Otherwise all reads will align normally.$
A. An example of a repeat expansion with a visualization of short read pile-up. Because we don't have long sequences to span repetitive gaps, it is impossible to know the length of this expansion. B. An example of an inversion. The only way we can detect this with short reads is by scanning for specific inversion points. Otherwise sequences will align to the reference as if no inversion has occurred. C. Similar to an inversion, a translocation only provides clues to this variant at the break point. Otherwise all reads will align normally.

Fortunately, we know some of the limitations of the current crop of sequencing technologies, although you’d be hard pressed to find any mention of these limitations in popular scientific reporting. While we are good at looking at single nucleotide polymorphisms, indels, and large copy number variations (duplications or deletions of large areas of sequence), we cannot detect other large structural variations such as repeat expansions (Huntingtons, Fragile X, etc), most inversions (regions that flip their orientation), and most translocations (regions from one chromosome that get stuck on the end of another chromosome). We also can’t easily determine whole genome haplotype phase (two of each chromosome, 1 from mom, 1 from dad and both contain different sequence!) or accurately distinguish sequence originating from pseudogenes (non-coding pieces of DNA that look exactly like genes but aren’t expressed and have no effect on phenotype). This is fully due to the fact that the cheapest and most widely used sequencing method is short read sequencing. Instead of sequencing the fully intact DNA strands from each chromosome, we break each chromosome up into smaller manageable pieces. We sequence these pieces and then reassemble the overlapping parts using computers and a reference genome (the product of the human genome project!). The problem with performing sequencing this way is that the genome contains many stretches that are repetitive and if mutations occur in these regions, we do not have long enough sequences to accurately determine their location or orientation. This is especially problematic when we consider how the DNA surrounding genes controls the expression of that DNA (orientation and location can matter!) or particularly in cancer where translocations are common that result in the creation of fusion proteins that have very novel cancer promoting functions.

So the solution to our problem seems to be that we just need to sequence larger pieces of DNA or the chromosomes directly. Why didn’t we do that in the first place?!?! Well, because it’s hard. DNA likes to bend, and curl, and stick to itself. It’s tiny so it’s impossibly hard to manipulate, but we do have some very smart people working on the problem and there are two new sequencers, long read sequencers, on the market that are capable of sequencing much longer fragments of DNA. The major problem is that sequencing a genome using these methods is very expensive, time consuming, and, because the technology is new, it is highly error prone. All of these issues are being worked out but because of these problems and because we aren’t really good at analyzing and understanding the longer data yet, it is very hard to justify the additional cost which is sometimes as much as 50x the cost of short read sequencing. As prices for longer read sequencing fall and the technology improves, we will undoubtedly produce more accurate and more widely relevant genomic data. And you don’t have to take my word for it. A major theme of this year’s Advances in Genome Biotechnology conference was focused on the need for longer read data and discussions of how large structural variations could play an important role in undiagnosed and common disease.

Finally, beyond the importance of the raw sequence information which codes for the proteins that make up your cells, and the machinery that keeps you alive, there are even more cryptic signals in the DNA that we think are important contributors to disease. These are called epigenetic changes or changes to the DNA or proteins bound to the DNA that occur outside of the coding DNA sequence. Epigenetic changes have been shown to affect the spatial (which cell types) and temporal (at what time) expression of specific genes or groups of genes which could contribute significantly to developmental disorders. We can profile these changes directly using a variety of techniques or look at them indirectly by surveying how the messenger form of the genomic information, RNA, changes in cells. However, teasing out cause and effect from epigenetic data is complex, as is understanding how insults to multiple genes contribute to the full spectrum of health and disease that we see in people.

If you will remember, I mentioned earlier that only 30-50% of the people that present in a clinic with what we think is a genetic disorder receive an answer using genome sequencing. Part of the reason why we can't assign cause in many of these patients is due to a phenomenon called missing heritability, actually, that's a bit of a misnomer because we know the information is in the genome somewhere, much like global warming was a misnomer for global climate change. The historical origin of the term "missing heritability" comes from the days of genome wide association studies. Quickly after the sequencing of the human genome, geneticists split into two camps regarding what the likely causes of disease were. One camp said most diseases were caused by common variants in the genome that are passed down through the generations. These groups put all of these common variants (but only single nucleotide polymorphisms) on GWAS chips and then went fishing in thousands of peoples' genomes. Unfortunately, the fishing expedition showed that these common variants were only associated with a small fraction of diseases, hence, missing heritability. However, at the same time other groups were hypothesizing that the majority of diseases were due to rare or de novo variants that aren't common in the genome. The thought being that diseases are generally selected against, so it doesn't make complete sense that disease would be the result of common human variaition. Today we know that both groups were right. Common variation is associated with risk factors for developing disease while rare variants and de novo variants are causally associated with disease in essentially all known mendelian diseases. However, there is still a large portion of genetic disease that isn't explained by common variation or rare varaition. Now, this could be because we just haven't discovered all of the common or rare variants yet (see the above section on structural variation and sequencing limitations again). Or it could be that the variants we have discovered haven't been associated with disease yet because they're involved in more complex genetic interactions. For example, epigenetic changes, epistasis (one gene determines the phenotype, another masks that gene by controlling how that gene is expressed), or polygenes (the interaction of multiple genes contribute to a phenotype) could play a significant role in disease. We’re really counting on major advances in sequencing technology and sequence data analysis to determine how these genomic unknowns contribute to disease and help us bring that diagnostic percentage closer to 100%. Although I have highlighted many of the things we don’t know about genomics, I do think that these limitations and challenges in the field represent a strong call to action that asks us to continue working hard on translating what we know into more personalized care. We are clearly on the precipice of making transformational changes in how genomics is used to empower clinical care, with clear examples already in the clinic for some diseases, just please don’t expect this is going to apply to everyone overnight.