Brian Krueger is the owner, creator and coder of LabSpaces by night and Next Generation Sequencer by day. He is currently the Director of Genomic Analysis and Technical Operations for the Institute of Genomic Medicine at Columbia University Medical Center. In his blog you will find articles about technology, molecular biology, and editorial comments on the current state of science on the internet.
My posts are presented as opinion and commentary and do not represent the views of LabSpaces Productions, LLC, my employer, or my educational institution.
Please wait while my tweets load
Aside from the dubstep pumping out of the Roche and Agilent booths, the volume of AGBT has been somewhat muted. There was no grand offering of new hardware or over the top promises of sequencing genomes on what now appear to be vaporware USB thumb drives. This is my first in person experience of AGBT, so as a virgin it seems for the most part to be rooted in the science despite the ridiculous parties and “showgirl” casino nights. The atmosphere here is unlike any other science conference I’ve attended. It’s like the bastard child of a Gordon Conference and a Las Vegas Porn Convention. I really hope that the deep pockets of Sequencing Centers are more influenced by the science than the free dinner parties and alcohol, but I have pretty low confidence in humanity. Regardless, I think everyone in attendance today was overwhelmed by a stunning talk from PacBio and the dramatic advancements of their long read technology.
The PacBio talk came on the heels of what felt like a warm-up opening act from Jeremy Schmutz of the Hudson Alpha Institute. Schmutz has been working with a start-up that was recently acquired by illumina called Moleculo, which promises highly accurate long reads using short read sequencers. I sat through two other talks, went to the Moleculo poster, and still do not have a clear idea of how the technology actually works. What I could deduce from the talks and the questions I asked at the poster session, Moleculo technology works by fragmenting DNA into 10 kilobase fragments. These fragments are then diluted and clonally amplified in separate volumes. How Moleculo does this clonal amplification of what have to be tens of thousands of small volume PCR reactions is beyond me. When I asked this question at the poster session, Dmitry Pushkarev of Moleculo said that it was simple: they have a $4 piece of plastic about the size of a shoe box that does all of the hard work. Right. So the first step involves magic. Once the 10kb fragments are clonally amplified, they are again fragmented into even smaller pieces and Nextera tagged. These Nextera tagged sequences are then sequenced on a HiSeq and the data output is sent back to Moleculo where they do the de-multiplexing and sequence assembly. As before, when asked how the Moleculo assembler can accurately span long repeat regions greater than the 100bp short read output, Pushkarev offered, “We wrote the best assembler ever.” Ok, I get it, you guys are going to be coy about how everything actually works. From an early business perspective, I guess that’s a great idea, but some of us like to know the details of how techniques work so that we can do our own validation and be positive that the data we’re getting out is representative of what has actually been sequenced. The sample to black box to data pipeline is not ideal to me, but others, like Schmutz, seem to be ok with this for now. Hopefully, as the Moleculo technology matures, we get more information from them about how the technique actually works.
Fortunately, Schmutz was at least able to show us some of his downstream data using Moleculo and also how it compared to other long and longer read sequencers such as Roche 454 and PacBio. Schmutz works mostly with plants, and plants are tricky organisms to sequence because they’re full of repetitive information that cannot be accurately sequenced using the current short read technologies. Schmutz showed that Moleculo reads appear to be highly accurate with an error rate of 1.2bp per 10kb of sequence; however, the Moleculo reads created a similar number of sequence gaps as shorter 800bp 454 reads and these regions had to be fixed with secondary MiSeq runs. In contrast, Schmutz showed that PacBio RS reads provided the least “gappy” data which is important from a cost perspective. It seemed like Schmutz final conclusion was that he liked the Moleculo technology the best because they did all of the hard work: they made the libraries and performed the read assembly without having to train staff on new techniques or machines.
After Schmutz had thoroughly fluffed the crowd over long read sequencing, Jonas Korlach from Pacific Biosciences took the stage to give a talk on the progress of assembling genomes using their long read sequencing technology. If you’ve been in the sequencing game for any amount of time you’ve definitely heard of PacBio and their early promise of accurate, realtime, single molecule, long read sequencing technology and their subsequent absolute failure in rapidly delivering those promises. PacBio has spent the last 2 years trying to dig itself out of a hole by aggressively working with early adopters to fix their original lapses in quality. In an earlier talk in the week (Slides!), Michael Schatz, another plant geneticist and PacBio early adopter declared, “You don’t have to worry about the errors anymore.” This is in part because of major improvements in PacBio chemistry and the introduction of a new more accurate polymerase. PacBio has consistently doubled its average read length over the last two years and has made gains on the error correction front. Last year, Schatz published a paper in Nature Biotechnology which showed that highly accurate illumina or 454 short reads could be used to error correct PacBio long reads to generate the most accurate long read data available today. Korlach followed up on this short read correction scheme by showing that the illumina/454 step can now be eliminated completely and researchers can use the short reads generated during the PacBio run to error correct the much longer reads in a process called Hierarchical Genome Assembly (HGAP), with a base accuracy of 99.99965%. That’s a far cry from the early real world production numbers of 85%! Korlach then supported these tech numbers with an astounding amount of real world data. Granted, all of the de novo sequencing presented was done in pathogenic organisms or bacteria with small circular genomes, but no one can dispute how impressive the data looked during the presentation.
The big question now is: has PacBio finally weathered the storm and can it overcome its previous reputation as a failed sequencing company? Time will really only tell. It’s hard to predict winners and losers in this space, especially in light of the coming sequestration and dwindling research budgets. PacBio may be peaking at the wrong time. With the threat of Moleculo long read technology on the horizon, major sequencing labs may hold out on purchasing PacBio RS systems. Why invest $700,000 in a sequencer if you can get “good enough” long reads off of your HiSeqs?
This post has been viewed: 11343 time(s)
I agree Brian, Illumina could have been a bit more forthcoming about Moleculo technology. I was disappointed with their claim of "10kb reads" when I found out they weren't all read as one piece.
I wrote a little more on the issue here: www.checkmatescientist.net/2013/02/illumina-unveils-little-about-long-read.html
Thanks Paul! I totally agree with your commentary on your site. I also read over that Bio-IT world interview. Great stuff there.
Brian, thanks for the review of AGBT - it's especially helpful for those of us who couldn't attend. But I think you're giving PacBio slightly too much credit on improving their error rate. They have made tremendous strides on the analysis front, but their raw error rate is still quite high, with an accuracy that ranges between 83%-87% (depending on which chemistry is used). The very impressive sounding 99.99965% comes from 20-50X consensus sequencing - all of the platforms impove when you look at consensus. However, PacBio does have the distinct advantage of having a stochastic error model, so it doesn't get stuck on 'difficult' regions and can plow through high or low GC content.
Shawn, thanks for the added insight. Jeremy also sent me an email about this. I asked him to post his comment here to clear up some of my own hand waving!
Hey Jeremy, thanks for the clarification! I really enjoyed your talk and seeing Moleculo in action. I just wish we knew more details about how it worked. One question though, why do you think Schatz told everyone to not worry about the errors anymore if the PacBio data still isn't optimal in your dealings with it?
Great stuff Jeremy! You bring up good points about gaps and bioinformatics. Despite the advances in technology, there is a lot of extra work that goes into assembling a de novo genome on the back end. Prior to my work at Duke I ran a de novo virus sequencing core lab at the University of Florida and even with small non-repetitive genomes we still had to do a alot of gap filling with Sanger runs. Thanks for the additional information regarding long read sequencing. I asked Schatz to stop by and join the discussion. Would it be possible for you to make the slides from your talk available?
I am certainly interested in both continuity and accuracy of PacBio sequencing. However, I no longer fear the 15% error rate like I first did, because we have more-or-less worked out the error model and algorithms that will let us "polish out" the vast majority of those errors. From a purely theoretical model, the consensus accuracy from even 20% random sequencing errors nevertheless approaches 100% accuracy with 15x coverage or higher (http://schatzlab.cshl.edu/presentations/2012-05-10.CSHL%20BOG.PacBio.pdf slide 7). When we measure the accuracy of real assemblies we see almost exactly this result, and indeed, this is what enables PacBio to sequence entire microbial chromosomes into single contigs at 99.99965% accuracy as you describe above.
For larger genomes, it will probably be more practical to use our hybrid assembly algorithm (Koren, 2012) to combine high coverage of short reads with the long reads. Interestingly, the steps for this are surprisingly similar to Moleculo sequencing. Moleculo, at a very high level, consists of: (1) isolate long molecules, (2) clever amplification and barcoding, (3) Illumina sequencing, (4) clever local assembly, and (5) easy long read assembly. For our hybrid approach, the steps are: (1) isolate long molecules, (2) clever PacBio sequencing, (3) Illumina sequencing, (4) clever local assembly, and (5) easy long read assembly. The two localization strategies have different strengths and weaknesses, and both are limited by our ability to select long molecules, among other shared limitations, but already we are seeing either approach can generate extremely high quality reads and extremely high quality assemblies.
I cant predict which technology will "win" or when it will happen, but I am extremely optimistic for the future. Historical improvements are no guarantees for the rate of advance we will see in the future, of course, and we will certainly have to work hard to research new algorithms to effectively use the data and correct for the biases and complications that will inevitably arise. Furthermore, I was entirely speculating when I said whole chromosome assemblies might be available next year (in fact that was the title of the slide!), but if future advances are anything like past advances, I dont think it will be too far off the mark before we can translate nearly perfect single contig per chromosome assemblies, now achievable in microbes, into more complex eukaryotes. Considering it was just a few years from the publication of the first microbe to the publications of the first eukaryotes, there is certainly precedent for quickly scaling up assembly technologies along these lines.