A coming of age for PacBio and long read sequencing? #AGBT13

Saturday, February 23, 2013

Aside from the dubstep pumping out of the Roche and Agilent booths, the volume of AGBT has been somewhat muted. There was no grand offering of new hardware or over the top promises of sequencing genomes on what now appear to be vaporware USB thumb drives. This is my first in person experience of AGBT, so as a virgin it seems for the most part to be rooted in the science despite the ridiculous parties and “showgirl” casino nights. The atmosphere here is unlike any other science conference I’ve attended. It’s like the bastard child of a Gordon Conference and a Las Vegas Porn Convention. I really hope that the deep pockets of Sequencing Centers are more influenced by the science than the free dinner parties and alcohol, but I have pretty low confidence in humanity. Regardless, I think everyone in attendance today was overwhelmed by a stunning talk from PacBio and the dramatic advancements of their long read technology.

The PacBio talk came on the heels of what felt like a warm-up opening act from Jeremy Schmutz of the Hudson Alpha Institute. Schmutz has been working with a start-up that was recently acquired by illumina called Moleculo, which promises highly accurate long reads using short read sequencers. I sat through two other talks, went to the Moleculo poster, and still do not have a clear idea of how the technology actually works. What I could deduce from the talks and the questions I asked at the poster session, Moleculo technology works by fragmenting DNA into 10 kilobase fragments. These fragments are then diluted and clonally amplified in separate volumes. How Moleculo does this clonal amplification of what have to be tens of thousands of small volume PCR reactions is beyond me. When I asked this question at the poster session, Dmitry Pushkarev of Moleculo said that it was simple: they have a $4 piece of plastic about the size of a shoe box that does all of the hard work. Right. So the first step involves magic. Once the 10kb fragments are clonally amplified, they are again fragmented into even smaller pieces and Nextera tagged. These Nextera tagged sequences are then sequenced on a HiSeq and the data output is sent back to Moleculo where they do the de-multiplexing and sequence assembly. As before, when asked how the Moleculo assembler can accurately span long repeat regions greater than the 100bp short read output, Pushkarev offered, “We wrote the best assembler ever.” Ok, I get it, you guys are going to be coy about how everything actually works. From an early business perspective, I guess that’s a great idea, but some of us like to know the details of how techniques work so that we can do our own validation and be positive that the data we’re getting out is representative of what has actually been sequenced. The sample to black box to data pipeline is not ideal to me, but others, like Schmutz, seem to be ok with this for now. Hopefully, as the Moleculo technology matures, we get more information from them about how the technique actually works.

Fortunately, Schmutz was at least able to show us some of his downstream data using Moleculo and also how it compared to other long and longer read sequencers such as Roche 454 and PacBio. Schmutz works mostly with plants, and plants are tricky organisms to sequence because they’re full of repetitive information that cannot be accurately sequenced using the current short read technologies. Schmutz showed that Moleculo reads appear to be highly accurate with an error rate of 1.2bp per 10kb of sequence; however, the Moleculo reads created a similar number of sequence gaps as shorter 800bp 454 reads and these regions had to be fixed with secondary MiSeq runs. In contrast, Schmutz showed that PacBio RS reads provided the least “gappy” data which is important from a cost perspective. It seemed like Schmutz final conclusion was that he liked the Moleculo technology the best because they did all of the hard work: they made the libraries and performed the read assembly without having to train staff on new techniques or machines.

After Schmutz had thoroughly fluffed the crowd over long read sequencing, Jonas Korlach from Pacific Biosciences took the stage to give a talk on the progress of assembling genomes using their long read sequencing technology. If you’ve been in the sequencing game for any amount of time you’ve definitely heard of PacBio and their early promise of accurate, realtime, single molecule, long read sequencing technology and their subsequent absolute failure in rapidly delivering those promises. PacBio has spent the last 2 years trying to dig itself out of a hole by aggressively working with early adopters to fix their original lapses in quality. In an earlier talk in the week (Slides!), Michael Schatz, another plant geneticist and PacBio early adopter declared, “You don’t have to worry about the errors anymore.” This is in part because of major improvements in PacBio chemistry and the introduction of a new more accurate polymerase. PacBio has consistently doubled its average read length over the last two years and has made gains on the error correction front. Last year, Schatz published a paper in Nature Biotechnology which showed that highly accurate illumina or 454 short reads could be used to error correct PacBio long reads to generate the most accurate long read data available today. Korlach followed up on this short read correction scheme by showing that the illumina/454 step can now be eliminated completely and researchers can use the short reads generated during the PacBio run to error correct the much longer reads in a process called Hierarchical Genome Assembly (HGAP), with a base accuracy of 99.99965%. That’s a far cry from the early real world production numbers of 85%! Korlach then supported these tech numbers with an astounding amount of real world data. Granted, all of the de novo sequencing presented was done in pathogenic organisms or bacteria with small circular genomes, but no one can dispute how impressive the data looked during the presentation.

The big question now is: has PacBio finally weathered the storm and can it overcome its previous reputation as a failed sequencing company? Time will really only tell. It’s hard to predict winners and losers in this space, especially in light of the coming sequestration and dwindling research budgets. PacBio may be peaking at the wrong time. With the threat of Moleculo long read technology on the horizon, major sequencing labs may hold out on purchasing PacBio RS systems. Why invest $700,000 in a sequencer if you can get “good enough” long reads off of your HiSeqs?

Recent Comments

Comment by Brian Krueger, PhD in What I wish I knew...Before applying to graduate school

Jaeson, that's not true at most places. Top tier, sure, but 1100+ should get you past the first filter of most PhD programs in the sciences. . . .Read More

Jun 24, 2013, 8:39am

Comment by Jaeson in What I wish I knew...Before applying to graduate school

All I can say is that GRE's really do matter at the University of California....I had amazing grades, as well as a Master's degree with stellar grades, government scholarships, publication, confere. . .Read More

Jun 19, 2013, 11:00pm

Comment by Michael Schatz in A coming of age for PacBio and long read sequencing? #AGBT13

Hi Brian, I am certainly interested in both continuity and accuracy of PacBio sequencing. However, I no longer fear the 15% error rate like I first did, because we have more-or-less worked . . .Read More

Feb 26, 2013, 12:13am

Comment by Brian Krueger, PhD in A coming of age for PacBio and long read sequencing? #AGBT13

Feb 25, 2013, 10:20am

Comment by Jeremy Schmutz in A coming of age for PacBio and long read sequencing? #AGBT13

Feb 25, 2013, 8:01am

Blog Comments

Page: 1

Follow the blog comments feed here.

Paul

Guest Comment

Sat, Feb 23, 2013, 5:35 pm CST

I agree Brian, Illumina could have been a bit more forthcoming about Moleculo technology. I was disappointed with their claim of "10kb reads" when I found out they weren't all read as one piece.

I wrote a little more on the issue here: www.checkmatescientist.net/2013/02/illumina-unveils-little-about-long-read.html

Paul

Brian Krueger, PhD
Columbia University Medical Center

Rate Post:

Sat, Feb 23, 2013, 9:10 pm CST

Thanks Paul! I totally agree with your commentary on your site. I also read over that Bio-IT world interview. Great stuff there.

http://www.bio-itworld.com/2013/1/18/moleculo-man-mickey-kertesz-illumina-long-read-acquisition.html

Shawn Baker

Guest Comment

Sat, Feb 23, 2013, 9:17 pm CST

Brian, thanks for the review of AGBT - it's especially helpful for those of us who couldn't attend. But I think you're giving PacBio slightly too much credit on improving their error rate. They have made tremendous strides on the analysis front, but their raw error rate is still quite high, with an accuracy that ranges between 83%-87% (depending on which chemistry is used). The very impressive sounding 99.99965% comes from 20-50X consensus sequencing - all of the platforms impove when you look at consensus. However, PacBio does have the distinct advantage of having a stochastic error model, so it doesn't get stuck on 'difficult' regions and can plow through high or low GC content.

Sat, Feb 23, 2013, 9:27 pm CST

Shawn, thanks for the added insight. Jeremy also sent me an email about this. I asked him to post his comment here to clear up some of my own hand waving!

Jeremy Schmutz

Guest Comment

Sun, Feb 24, 2013, 1:01 pm CST

Hey Brian- good post, I don't like Moleculo because they do the work (i dont have problem reconstructing sub assemblies of millions of reads:) although many people do appreciate the ease of data collection. I like the technology because it does something with a plastic shoebox they we need do by handling freezers worth of clones- bin very small amounts of target DNA into a local assembly. Assemblies aside- this gives a very clear advantage which seems to be lost on most folks who don't do this kind of work- if you can trust the subreads you can do all kinds of stuff with them you can't do with assembled wgs data. Anyway, Pacbio looks promising, but reserve judgement until you work with a lot of 80 percent accurate reads or try to generate enough coverage. I'm excited about the developments but reconstructed consensus is not the same as directly sequenced high accuracy bases. Hopefully, improved yields are in the future for Pacbio.

Sun, Feb 24, 2013, 2:13 pm CST

Hey Jeremy, thanks for the clarification! I really enjoyed your talk and seeing Moleculo in action. I just wish we knew more details about how it worked. One question though, why do you think Schatz told everyone to not worry about the errors anymore if the PacBio data still isn't optimal in your dealings with it?

Mon, Feb 25, 2013, 8:01 am CST

Brian,

I don't know why shatz doesn't appear to be concerned about the accuracy of Pacbio for plant applications. You would have to ask him. We operate in different spaces- shatz is concerned about linking and increasing contiguity for drafty draft quality genome space, I'm concerned about going beyond everything that is already doable now with illumina and 454 with a bit of effort and what Pacbio and Moleculo can add to this. Quality, contiguity, and large scale organization are extremely important to downstream functional work- but this has to extend to accuracy of consensus and I'm looking for technologies to get us back to the days of accurate clone by clone sequencing combined with the science that can be done with amazing amount of data from ngs systems.

On the question of how Moleculo works, I know very little beyond what is obvious. Anyone in this field that has been around for a while could explain how to replicate it without knowing much about how it works. The innovation appears to be going from a microfluidic setup to a "plastic box" as you call it. In doing so you lose access to individual micro bubbles but gain massive increases in speed and decreases in cost of library prep. I'd prefer not losing access to individual sheared molecules, as the Moleculo approach then becomes a pooled approach- however the apparent small size of the pools helps to offset this direct access. We can replicate Moleculo on a simple scale and we do every day by directly sequenced tagged pooled fragments amplified with ecoli or PCR. But it's a lot harder to do so and so I see Moleculo as a mid level attempt at improvement, something to do after illumina draft and before targeted directed improvement from clones.

Also, I'd remind people that de novo genomes are not produced these days from one technology alone. Rather you can produce a genome, but complete or mostly complete or reference quality can't be produced without multiple stages of data collection. The myth that any one of these platforms can produce a complete genome that rivals a finished genome is one perpetrated by these sequencing technology companies each of which is trying to show what can be done with their technology. Yes, yes even with Sanger sequencing this isn't possible. For example, we routinely use directed sequence from illumina and 454 to improve our sanger based flagships. But each of these data collection tools have their own biases and issues. I mentioned AT/GC bias in my talk and these are serious issues for Illumina and 454, because ecoli was better at amplifying and sanger was better at detection as it averaged the outcomes of millions of replications rather than bottlenecking down to a single molecule and then averaging the signals of artificial replication by a single polymerase. Would you trade the couple 100 bp pair cloning holes every million bases for a gap every 10kb from an ngs system? I would (because I know as you get close to completing the puzzle it gets much easier to place the pieces in the holes) and so would the 10s of 1000s of scientists who work with reference genomes and presumedly the billions in research dollars spent to work with these organisms. There are problems where having massive data collection greatly simplifies them and much exciting science can be done with ngs platforms such as large scale genotyping for natural and mutant populations that impacts nearly all of biological science. We've seen an amazing convergence around DNA across fields beyond human and medical genetics- into ecology, evolution, agronomy, crop science and it is leading us to finer and finder scale understanding of how things work on a fundament biochemical intersected with spatial temporal scales! Which is very cool, even if I need to use a series of run on sentences to explain it .

And do always try remember, all of these sequencing platforms and algorithmic analysis pipelines are just tools- I wouldn't want you to build my house if you just showed up to the job site with a screwdriver, even if it was a very very expensive screwdriver!

Mon, Feb 25, 2013, 10:20 am CST

Great stuff Jeremy! You bring up good points about gaps and bioinformatics. Despite the advances in technology, there is a lot of extra work that goes into assembling a de novo genome on the back end. Prior to my work at Duke I ran a de novo virus sequencing core lab at the University of Florida and even with small non-repetitive genomes we still had to do a alot of gap filling with Sanger runs. Thanks for the additional information regarding long read sequencing. I asked Schatz to stop by and join the discussion. Would it be possible for you to make the slides from your talk available?

Michael Schatz

Guest Comment

Tue, Feb 26, 2013, 12:13 am CST

Hi Brian,

I am certainly interested in both continuity and accuracy of PacBio sequencing. However, I no longer fear the 15% error rate like I first did, because we have more-or-less worked out the error model and algorithms that will let us "polish out" the vast majority of those errors. From a purely theoretical model, the consensus accuracy from even 20% random sequencing errors nevertheless approaches 100% accuracy with 15x coverage or higher (http://schatzlab.cshl.edu/presentations/2012-05-10.CSHL%20BOG.PacBio.pdf slide 7). When we measure the accuracy of real assemblies we see almost exactly this result, and indeed, this is what enables PacBio to sequence entire microbial chromosomes into single contigs at 99.99965% accuracy as you describe above.

For larger genomes, it will probably be more practical to use our hybrid assembly algorithm (Koren, 2012) to combine high coverage of short reads with the long reads. Interestingly, the steps for this are surprisingly similar to Moleculo sequencing. Moleculo, at a very high level, consists of: (1) isolate long molecules, (2) clever amplification and barcoding, (3) Illumina sequencing, (4) clever local assembly, and (5) easy long read assembly. For our hybrid approach, the steps are: (1) isolate long molecules, (2) clever PacBio sequencing, (3) Illumina sequencing, (4) clever local assembly, and (5) easy long read assembly. The two localization strategies have different strengths and weaknesses, and both are limited by our ability to select long molecules, among other shared limitations, but already we are seeing either approach can generate extremely high quality reads and extremely high quality assemblies.

I cant predict which technology will "win" or when it will happen, but I am extremely optimistic for the future. Historical improvements are no guarantees for the rate of advance we will see in the future, of course, and we will certainly have to work hard to research new algorithms to effectively use the data and correct for the biases and complications that will inevitably arise. Furthermore, I was entirely speculating when I said whole chromosome assemblies might be available next year (in fact that was the title of the slide!), but if future advances are anything like past advances, I dont think it will be too far off the mark before we can translate nearly perfect single contig per chromosome assemblies, now achievable in microbes, into more complex eukaryotes. Considering it was just a few years from the publication of the first microbe to the publications of the first eukaryotes, there is certainly precedent for quickly scaling up assembly technologies along these lines.

Thank you,

Michael Schatz

Add Comment?

Comments are closed 2 weeks after initial post.