Webinar: Understanding and Controlling for Sample and Platform Biases in NGS Assays
Category: Reference Standards
Dr. Jonathan Frampton, Associate Director, Product Management & Marketing, Horizon Discovery
Dr. Natalie LaFranzo, US Customer Support & Technical Support Scientist, Horizon Discovery
In this webinar we will focus on next generation sequencing and discuss: the effect of formalin treatment on molecular protocols and how it can be controlled, how to select the sequencing platform most appropriate for your sample, evaluation of whether structural variants are being detected by your assay and how Reference Standards can be used in each validation parameter.
JF: So what I want to talk to you guys about today, and Natalie will be the co presenter, and will be getting into the real nutty gritty details and the really exciting data is NGS and applying NGS . One question I like to ask at the start of my presentation, and it’s a question for you to consider as we go though the different slides as we continually ask it, is what is the impact of assay failure in your laboratory and how do you monitor for it. And this is a question that really embodies the passion and drive at Horizon that we want to, we do want to answer this and there’s lots of different answers that can happen can impact your laboratory and by developing different types of reference standards we hope to be able to support you and help you to answer that question,
And now with next generation sequencing, it has an number of attractive qualities and one of them which resonates well with people like yourself is that you take you tumour sample, you can extract the DNA you can do a library prep for your NGS workflow and it then allows you to test for the status of different genes such AKT1, BRAF, EGFR, KRAS, NRAS and many more. And so switching from a qPCR based method or a Sanger sequencing based method you can start to really extract a lot of information from just one sample and where, certainly in molecular diagnostic assays, where this could be going is in the cell free DNA samples where you may have very limited material from blood to run such assays or likewise if you ‘re doing fine needle work aspirations the sample preparation and the quantity of the sample is going to be very precious to you and so being able to apply a workflow to get a lot of information out for you is an attractive proposition.
So what does any new technology bring?
So new technology is there to bring a lot of good things and the challenge to any new technology is the variability and understanding the full workload so you can mitigate any problems that arise at each step. This is a workflow, you’ve got your tumour sample you’ll extract the DNA, and you’ll then need to quantify the DNA, go though library prep, sequence it then go through bioinformatics where you analyse and action for it. What we have developed at Horizon – and this isn’t what we are going to talk about in today’s webinar, but there are other webinars you can listen to, but we’ve developed a whole range of reference material that can help you to answer the question about variability at each step and while we are looking to develop one universal reference standard, we currently have a nice panel that answers the question at each stage
Before I hand over to Natalie I want to try to get some of data into my slides into my first half of the talk. This is really nice data from an Ampliseq panel using one our most popular reference material, that’s our quantification multiplex and what we did was we sent this quantification multiplex to three different partners and asked them to run this and look for this specific variants so there are 11 different oncospecific mutations at a given allele frequency, so all of the blue bars so that’s Horizon so that’s us that’s what we defined the reference standards to be and qualified that using digital PCR and what’s nice is that the majority of the partners can identify all of the mutations when its above 5% I’ve put a couple of highlighted things in where actually there’s some unusual results from our partners – an eGFR deletion, third mutation along two of the partners just couldn’t identify that and that’s supposed to be around 2%. The EGFR 2790M which is two more along, again a couple of partners struggled to identify that in the sample with the panel and so on and so on. And what is the key take home message, or the building block for the rest of the talk is while next generation sequencing opens up a number of exciting opportunities, its critical we all understand where the limitations may be and what we are keen and very passionate about doing is providing samples that allows you to really tease apart how good the workflow is to help identify where any variability could creep in. Before I hand over to Natalie we are always very open to questions, so if you did want to ask any questions during the talk you can type into the panel on the right hand side we will be looking into answer them at the end of the session.
It is my great pleasure to introduce Natalie, who is our US customer and technical support scientist she has a great background in next generation sequencing, a PhD in chemistry from Washington University. And so without carrying on talking to much as I do enjoy doing that a lot, I’ll pass you on over to Natalie and enjoy Natalie’s slides. Thank you.
NL: Thank you Jonathon. So I wanted to start by providing a brief introduction into next generation sequencing. When we think about the advancement of next generation sequencing you can consider how that invaluable resource it’s provided to researchers in multiple industries in multiple disciplines. It’s really going to be a major drive in personalised medicine revolution we’re in the middle of. However when we look at the cost of sequencing that’s plotted here from the National Institute of health, it’s not taking into consideration the decrease in cost, and it's not taking into consideration the significant cost associated with infrastructure and expertise that is required for a robust and routine NGS pipeline.
And so if we consider what was predicted by Sloner in 2011 the cost of the sequencing portion of the experiment continues to decrease but the cost associated with upfront experimental design and downstream analysis that’s now what is dominating the cost of each assay. And this is true whether you’re performing pre clinical R&D projects but even more so when we're talking about clinical assays. And so in this paper the authors note that the unpredictable and considerable human time that's spent on the upstream design and downstream analysis is really the important part of this. And as Jonathan mentioned here at Horizon we’re aiming to develop tools that help researchers and clinicians optimise their workflow so that NGS can be more reliable and also ultimately more affordable because you’ve streamlined those resource intensive areas.
And so depending on the different genetic material you’re starting from and the experimental workflow that you follow you can apply next generation sequencing to many applications as I’ve summarised here. But for today’s webinar we’re focussing specifically on DNA resequencing, which implies the data generated is going to be compared to an existing sequence such as the human genome, and we’re aiming to identify oncowells and mutations including single nucleotide variants, insertions and deletions, copy number variants and translocations.
And as Jonathan showed earlier when we think about the full NGS workflow for resequencing applications there are many molecular manipulations that occur to generate the fully analysed data set, and in each of these steps you can have bias and error introduced but today we're just going to focus on the variability and sources of error in three of those points. Firstly the biological sample itself. The second is the molecular steps in preparing a sequencing library and finally we’ll dive into the specific bias associated with some of the commonly used sequencing platforms. We won’t go into detail today – if you’re interested in learning more about the bioinformatics and analysis, we have a recording of a previous webinar on our website and you can access that at your leisure.
So a great reference that I found that I’ll go back to on the way today is this fantastic perspective article in Nature Review Chinanex and it’s titled the role of replicates for error mitigation in next generation sequencing. And so I’ll refer back to this workflow that they presented along the way of the webinar. And so as Jonathan mentioned, in each laboratory as a researcher or a clinician you should really be aware of the impact bias and error have on your own assay. And so as was requested in our previous webinar, I will highlight a few pints throughout the webinar where known reference standards that we offer here at horizon have been shown to be really vital in better understanding and combating this to ensure that your assays are accurate and reproducible.
And these points are particularly important when you’re considering patients samples of low quality or low quantity and especially when you’re trying to detect low allelic frequency variants. Really for NGS technology to evolve, and for us to better understand things like haplotype phasing, determine diff between two dysfunctional gene copies, we have to have a good understanding the error rates and limitations of the pipelines we are using.
And so first we are going to focus on the sources of error associated with the biological sample itself. Some of these errors such as mislabelling or sample mix ups are due to human error and they may be diff to predict. And sometimes you’ll have some unknown unknowns if you will; running a known predictable standard alongside your own samples can at least help introduce a checkpoint in your data. And so then you also have inherent biological variability, such as tumour heterogeneity or mixed cell populations, and that can add additional variability into your sample. And in this case these biological replicates can help identify true positive variants in a cell population. And we’ll discuss this in a bit more detail later.
When we’re thinking of those samples when limited sample is available, this will require us to amplify DNA by PCR in order to achieve sufficient material for sequencing. And the degree of amplification required will depend on the library preparation approach that you select. And finally you have the degree of fragmentation. Whether that s inherent to the sample such as in cfDNA, or if it’s a result of the preservation method that has been used, such as in FFPE samples.
So when we think about the offering that Horizon has for FFPE or formalin compromised materials we’re really considering how these perform differently in an NGS assay with regard to variability in primer binding, amplification, or even exclusion during size selection as a result of high fragmentation. So what we have are these genetically equivalent multiplex reference standards. 12.51
Which are offered in several formats, from our HD701 which is a high mole weight with high DNA integrity number, two formalin compromised and fragmented DNA with low DNA integrity numbers. And so these allow researchers to evaluate how fragmentation affects detection. And specifically on this slide I focussed on HD-C749 which is a mild formalin treatment, so that has been treated with formalin but still has a high molecular weight and high DNA integrity numbers. And then HD-C751 which has been subjected to a harsh formalin treatment and so you’ll see there that the DIN numbers for those are much lower. And so this quantitative multiplex range, which includes this formalin compromised DNA format, can be used as a set together to investigate the effect of formalin on your workflow. Starting from the very beginning with the quantification all the way through the library preparation stage. And these standards have this characterised fragmentation pattern so you can see on the slide, and they’ve been quantified using the QuBit assay which prevent overestimation which has been demonstrated to happen with the nanodrop assay. And so we’ve targeted defined allelic frequencies and with the comparison between the mildly formalin treated sample and severely treated sample, you can investigate the effect of formalin on the allelic frequency determination of your assay.
When we look at something like amplification we compare each HD-C749 and HD-C751 and again these standards are derived from genetically equivalent cell lines, really the only variable we’re changing is the integrity of the formalin treatment. And so you can see on the spot here corrected during our droplet digital PCR validation the CT values for these two standards are different. Both in their quantitative value but also in their reproducibility. This doesn’t necessarily mean that HD-C751 is performing poorly, but rather it’s a reference for better understanding the expectations one should have for their own biological samples that have been treated with formalin. Importantly the impact of this variability in the amplification will be dependent on the library approach that you’re using.
This is further exemplified when we consider the allelic frequencies that we observe during our digital droplet PCR validations. And so when we compare these three matched lots of HD-C749 with HD-C751, batch one of each one is derived from an identical cell line mixture. We can observe quantitatively here the effect of formalin treatment on variability. So here we observe that for all lots of HD-C749 – the detected allelic frequency was split between the acceptable ranges that we would use for normal non formalin treated samples. However in their matched lots where the allelic frequencies are biologically identical, but he cells have been subjected to a more severe formalin treatment, some of the variants fall outside of the benchmark for non-formalin treated samples. And it’s interesting that not all of these are outlier allelic frequencies but rather are likely dependent on the genomic context of that region. So to emphasise, both of these still indicating that HD-C751 is not performing properly but rather HD-C751 provides us with these realistic expectations and allows us to better understand the variability that we will observe in our own formalin treated samples.
As I mentioned earlier, the impact of your sample variability is really affected by the approach that you take in preparing your sample for sequencing. So now we’re going to dive a little bit deeper into that library preparation step. So while our knowledge of the human genome has been significantly advanced following the completion of the Human Genome Project, there are still many regions that are either still not well understood or very repetitive. As a result it’s preferred to interrogate a reduced representation of the genome focussing on specifically on the exonic regions, or the smaller targeted regions of interest with short 100-300 base pair sequencing reads. And given the high throughout nature of NGS, you can introduce multiple samples in a single sequencing run which makes each assay more affordable and more efficient. If we think about the primary ways of approaching a targeted sequencing we can consider solution based enrichment which uses panels to capture specific regions. And secondly targeted PCR to generate short amplicons. So reference standards can play an integral role during your protocol optimisation especially when focussing on sequencing library preparation. To be able to have a standard that mimics the concentration, the genomic diversity and the fragmentation level of your actual clinical sample. You want to ensure there are no competing regions for probe or primer binding, that the input quantity, quality, the concentration yields the optimal library output. And if we again consider the formalin compromised reference standards, we compare whether it would be beneficial to use a PCR free library for example to eliminate amplification bias, or really just to better understand how formalin treatment affect the amplicon generation step. Horizon reference standards are all derived from genomic DNA from cell lines which makes them much more realistic than a plasmid or a synthetic standard.
Parameters you may want to evaluate at the library stage of your assay optimisation would include how your library protocol performs in providing adequate sequencing coverage over your region of interest. Whether your PCR amplification steps are introducing unwanted sequencing changes. Whether certain regions are preferentially amplified and then also the many potential effects of primer bias. You may have the option of moving to a PCR free library kit, and then an alternative is introducing molecular barcodes such as in NuGen Single Primer Enrichment Technology.
Importantly if you want t detect some of those more difficult are rarer mutations, such s larger insertions and deletions, copy number variants or translocations, you really want to evaluate whether this is feasible with your library protocol and your overall pipeline. So what we’ve done is combine each of these different types of variants into our early access HD-753 which we call our Structural Multiplex DNA Reference Standard. There you can use a single sample with well defined genomic complexity to evaluate that workflow. Importantly this standard also allows you to evaluate the genomic context with regard to GC content which is important in terms of focal library protocol but also the sequencing platform itself.
To move onto the next stage in the NGS process we also have to consider the bias associated with the platform itself. Once the sample has been prepared for sequencing there are additional molecular steps that occur on the machine in the actual data generation step of the assay.
Now we’re going to dive into the three platforms. You have three common platform manufacturers most commonly used in laboratories around world and these include the Illumina series, which includes the MySeq, NextSeq and HiSeq, the X10 and most recently the X5. And then the Ion torrent personal genome machine that’s also known as the PGM and the Ion proton. And finally the Pac Bio platform. Each of these has advantages in terms of read length, output, and the multiplexing and overall versatility but across the board there are platform biases that may occur. Some of these include user error, where the platform is overloaded and doesn’t perform properly which results in poor quality data. You also have machine failure and reagent issues which are ubiquitous across platforms but there are also these platform specific biases that occur due to differences in chemistry that each one uses. When we look at Illuminas chemistry this makes use of clonal amplification to generate clusters of short fragments which shown in the figure on the left and these are fluorescently detected using sequencing bi-synthesis . As you’ve got your terminal base as its deprotected, all four fluorescently labelled nucleotides flow over the load cell along with the polymerase. Following a single base incorporation the flow cell is imaged to detect base was introduced into each cluster. A hypothetical image of this is on the top right and then an actual image from the machine is on the bottom. The process then repeats itself for a number of cycles – the number of cycles will determine the read length and the data from each cluster is grouped together and this is considered one read. In this approach there are a few places where error can occur. First you may have substitution errors that may arise when incorrect bases are incorporated during natclonal amplification, which could then be propagated throughout the cluster. Additionally with the Illumina sequencing it’s been found to have a sequence specific error profile which is thought to be a result of single strand DNA folding, or a sequence specific difference in enzyme preferences.
For more information on this I would recommend checking out the paper by Nakamura which goes through this in a little bit more detail.
When we consider Ion Torrent sequencing, where again you see sequencing bi-synthesis, here fragments are prepared using emulsion PCR again generating a clonal set of fragments which are then deposited onto a microwell on the sequencing chip. Rather than making use of labelled nucleotides and optics during the synthesis and detection here individual nucleotides are flowed over the chip individually again with the polymerase and as the release of positively charged protons happen, these are electronically detected using a sensor in each of the wells. So the number of hydrogen ions released, and therefore the signal is in proportion to the number of nucleotides incorporated. With this platform if repeats of the same nucleotides, which are known as homopolymers, are being sequenced, then you have multiple hydrogen ions released in a single cycle. This results in an increased electronic signal which can often be difficult to quantify for long repeats. This limitation is shared by other technologies that detect single nucleotide incorporations such as Empiro sequencing on the Roche 454 platform. When you have this homopolymer bias this results in insertion and deletion errors in the sequencing data. 23.45
So one of the newest technologies, the PacBio platform, this also uses a modified version of sequencing bi-synthesis. However the difference here lies in a single polymerase enzyme in each well so single molecules are sequenced and this is why PacBio is also called Single Molecule Real Time or SMRT Sequencing. So here the ? polymerase is immobilised at the bottom of what is known as a zero mode wave guide, or ZMW. Then as fluorescently labelled nucleotides are diffused into the ZMW chamber then you have illumination directed to the bottom of the ZMW and the nucleotide held at that polymerase prior to incorporation, in this it’s an extended signal. That slows the base that is being incorporated to be identified. So when that nucleotide is incorporated by the polymerase then thee fluorophore is cleaved and it diffuses away. The specific base that is called is based on the fluorescent tag that is detected and so the software generates a set of reads for each molecule. PacBio SMRT sequencing gets impressive read lengths, averaging 10-15 kb in length. However if the fluorescence signal is not detected due to quenching or non labelled nucleotides, false insertions and deletions may be incorporated into those reads. Due to the nature of the single molecule detection of this technology these errors are more prevalent than other technologies. When we compare to Illumina or Ion Torrent where clusters or droplets of molecules are sequenced. So this results in the error rate of the PacBio machine being inappropriate for the detection of low allelic frequencies as they may be very difficult to disambiguate from noise. Furthermore the output of the PacBio machine is not yet at the same level as the Ion Torrent or Illumina sequencers, but this does continue to improve.
I will note though that the PacBio does show impressive resistance to GC bias and so as shown in the graph on the right we see that the relative coverage, the GC bias for the MySeq, Ion Torrent and PacBio in a couple of different genomes with varying GC content. Here unbiased coverage would be represented by a horizontal line with a relative coverage equal to 1, which is shown here as the dashed flat line. So PacBio does have significant advantages in this area.
While we see the PacBio coverage levels are the leas biased and, second to that would be the Illumina, but all technologies exhibit error rate biased in this high and low GC regions, especially in long homopolymer runs. And so by combining data from two technologies you can reduce the coverage bias if the bias in the component technologies are complementary and similar in magnitude. In the data I’ve shown here the cross platform replicates were compared - these DNA samples were blood and saliva and were sequenced on to different platforms, the Illumina and the Complete Genomics platform. This resulted in an 88.1% concordance of SNP’s across these replicates. In another study that considered the Illumina, the Roche 454 and Solid Sequencing they observed 64.7% concordance. I think these two studies demonstrate the importance of considering the platform bias that you have in the pipeline that you’ve designed.
Aside from cross platform replicates, what other ways can replicate help mitigate the errors associated with our NGS assay? Biological replicates, which in this figure are represented by R, samples that are independently prepared under the same conditions from the same host, as well as technical replicates which are usually considered a repeat analysis of the exact same sample – so in the case of a NGS sequencing library that would be your technical replicates. Both of these have significant utility in protocol optimisation, protocol validation and even late in your sample analysis. So biological and technical replicates can be used to assess your specificity and sensitivity of your sequence variant coding methods, really in a manner that is independent of your algorithms, and the chemistry that’s used to code the variants which helps you understand your quality score thresholds or the parameters that you’re choosing.
Technical replicates are usually considered a repeat analysis of the exact same sample, and here what we’re using are the same sequencing libraries sequenced on the same machine, using the same parameters. When you run these technical replicates during your initial pipeline development, or when you’re evaluating new protocols or new sample types, you’ll obtain information on the degree of noise in your pipeline. However most laboratories would prefer not to use precious sample for these types of upfront experiments. So by using Horizons multiplex reference standards, this allows you to perform these experiments using a renewable, cost effective resource. And so with these standards you can determine the sensitivity, specificity and the limit of detection of your particular protocol which helps you avoid false variant calls, and also assists with data interpretation and the confidence that you have. Furthermore, as we mentioned in the Q&A session, Horizon standards are validated using a completely orthogonal technology, droplet digital PCR and so that’s really crucial for understanding the bias in your sequencing protocol.
To wrap up the NGS workflow I just wanted to briefly mention again the role bioinformatics analysis can play in introducing variability. I would recommend checking out under Scientific Support our archived webinars where you’ll find our bioinformatics webinar that was held last month.
To come back around the figure I showed at the very beginning of our discussion, if we make use of reference standards to refine both the upfront molecular protocols and the downstream analytical pipeline, we can help improve the cost and utility of next generation sequencing. So once we’ve accomplished that it’s really only the downstream analysis and as authors note in this paper, linking a genetic variant to a phenotype, particularly a clinically relevant one, requires a significant amount of expertise and effort. First in identifying these highly complex variants and then estimating their functional impact. Finally to being able to select from the functional ones, those that are correlated to the phenotype. What you need here is an interdisciplinary team of bioinformaticians, technicians, geneticists, biologists, physicians all required to be on hand to translate the information in the primary data to useful knowledge and the impact of the genomic variants in the biological system. This can take weeks and month of extensive experimental validation using things like downstream animal models or cell lines. Here at Horizon we are a multidisciplinary team with experts in spanning the range of translational genomics.
So at this point I want to hand things back to Jonathan, to briefly introduce how our team fits in to the Horizon offering and the other resources that we have available.
JF: Brilliant, thank you Natalie, that’s a fantastic set of slides. You almost enjoyed that. We’ll start wrapping up with 4 or 5 slides now just to bring it all back around together. At Horizon we are a fully translational company and we’ve got services and products that range all the way from genomic to translational genomics, through to personalised medicine. There are different range of areas we support and work with is in basic research, drug discovery, drug development, drug manufacture and molecular diagnostics. Typically we’ve always focussed on these webinars in the molecular diagnostic space – I should point out that our products are for research use only and should be used appropriately. However because of the nature of this webinar and what is possible with next generation sequencing it may be actually other interesting areas that we can support in and you may want to dive in and ask us specific questions on cell lines for basic research or in vivo models, generation of knockout mice and all of that stuff.
So that brings me round to the question that I asked at the start and that is what is the impact of assay failure in your laboratory and how do you monitor for that? In the molecular diagnostic and molecular assay side of things, there are three questions that we are commonly asked about and these are questions we are very passionate about supporting and very good at supporting and helping laboratories to answer. And that’s what extraction and quantification methods are you using? What is the true limit of detection of your workflow? And is the impact of formalin treatment interesting to you? What I’ve got here is a nice summary slide of the 5 different groups of reference standards that we’ve developed at Horizon. And it’s a bit of a teaser slide as if you want to know more about it then please do get in touch with us. So on the y-axis we’ve got the sample features, so for us that’s how many mutations have we included in the sample and on the x-axis you have sample complexity, so that’s going into how well plated that sample is, or have we got a lot of indels, translocations or large chromosomal deletions, that sort of thing. The products we’ve talked a lot about is the quantitative multiplex reference standard which is the circle second from the left labelled QMRS. That’s got a good mix of complexity with an EGFR deletion, a known frame shift within the cell lines that are sometimes useful for testing the complexity of the next generation sequencing workflow. For people looking into diving into limit of detection and identifying how good your assay is we have gene specific multiplexes and they’re available as DNA or FFPE. We’ve got our true cube DNA range which has a lot of features so that range covers 40different mutations and there’s 3 different tiers to that – you’ve got a 5% tier, 2.5% tier and a 1% tier. We are very proud to working within the genome in a bottle consortium running out of NIST, we have relea