How to get the most out of your NGS data with OncoSpan

The OncoSpan product range offers the world’s largest oncology reference standard set with a companion high-coverage, batch-specific in silico NGS data set.

image representing sequencing

We provide a list of high-confidence variants across a defined set of genes as well as BAM files containing sequence coverage for the entire exome, so you can use the data to suit your needs and additional applications.

In this article we explore how to examine your data to investigate false positives and negatives, and potential downstream applications such as analyzing tumor mutational burden and microsatellite instability.

Identifying true positives

OncoSpan’s list of over 380 high-confidence variants can be used to identify true positives in your own NGS data. High confidence variants are those observed by Whole Exome Sequencing (WES) in three batches of the OncoSpan reference standard during product development. 25 of these variants are verified in each batch during manufacturing and quality control by ddPCR.

Identifying false positives

A false positive is a variant which is detected in your data but is not actually present in OncoSpan.

If you detect a variant that is not on the high confidence variant list, it is not necessarily a false positive. Only variants in genes on the OncoSpan gene list were considered for the high confidence variants list. This means a variant could have been present in all three OncoSpan batches tested but was not considered. As our reference material is cell line-derived, there are lots of variants present naturally that are in addition to those on the OncoSpan gene list.

To identify false positives, check if the variant is in the batch-specific BAM file, or low confidence VCF file – which is the raw output from the variant caller. The exome-level BAM contains the filtered, aligned reads for your batch of OncoSpan and can be used with tools like Integrative Genomics Viewer (IGV)^1,2 to investigate variants at a deeper level.

If the variant is not present in either of these files, you can use the troubleshooting tips below.

Investigating false negatives

A false negative occurs when a variant is observed in the OncoSpan high-confidence list but is not present in your data.

The simplest way to identify false negatives is to use a tool like BEDTools^3,4 to compare the high-confidence variants list to your own VCF-formatted file. OncoSpan VCF files contain these variants along with the batch-specific allelic frequency and read depth.

Troubleshooting false positives, negatives and differing allelic frequencies

Sequencing coverage and read depth

Lower read depths may not be sufficient to detect variants present at low allelic frequencies. Example: when using 30x coverage, observing allelic frequencies below 10% will be difficult – it would require a minimum of 3 reads out of 30 to contain the variant. Low read depth at the locus of interest can lead to higher or lower frequencies than expected. Example: if there are only 10 reads aligned to a given region, the difference between 1 or 2 observations of a specific variant causes a 10% change in allelic frequency. You might encounter GC bias during sequencing. This is when GC-balanced regions are favored but GC-poor or GC-rich regions are disfavored. This could lead to uneven coverage, or even no coverage of regions. To resolve this, you may need to increase sequencing depth.

Variant caller and parameters

For all OncoSpan products, we use the Illumina DRAGEN Somatic Pipeline. Each variant caller calculates read depth and allelic frequencies differently, and has different parameters and minimum thresholds built in. Insertion or deletion (indel) size can cause problems for variant callers as large indels can be difficult to call.

Follow any of these steps to investigate further:

• Load the BAM-formatted file into a tool like IGV to confirm the presence or absence of an allele

• Use the OncoSpan BAM file with your own variant caller for a more direct comparison between the OncoSpan data and your own

• Run the raw FASTQ files through your variant caller workflow

Sequencing kit

All OncoSpan samples were sequenced using the Agilent SureSelect Human All Exon V6 kit. Depending on the sequencing kit used, it is possible that a particular gene region is not covered. You can use a tool like IGV to look up reads in a particular region in your BAM file.

Strand bias

This occurs when the genotype inferred from the forward and reverse strand differs and can lead to false positives. Example: The forward strand genotype is heterozygous, and the reverse strand genotype is homozygous. When the genotype is heterozygous, you expect to see this in both forward and reverse strands. If the alternate allele is only seen on one strand, it may be a false positive.

Suspected strand bias can be investigated with tools like IVG.

For both false positives and negatives, if these suggestions do not resolve your issue, it may be necessary to perform a follow-up analysis using ddPCR and Sanger sequencing.

Calculating MSI and TMB

Microsatellite instability (MSI)

The OncoSpan sequence data covers the entire exome so you can use the BAM-formatted file for calculating both MSI and TMB. However, calculating MSI can be challenging as software tools typically require a normal sample for comparison. One option is the use the sequence data from a normal sample in the 1000 genes project (e.g., NA12878). You can also rely on your own in-house controls and software or focus on using software tools that work with tumor-only samples.

Horizon does not currently offer MSI values for the OncoSpan products, but we do provide both MSI and MSS (microsatellite stability) reference standards in our catalog. The microsatellite status of these cells is confirmed using PCR and probe-based fluorescence (not NGS).

Tumor mutational burden (TMB)

As the OncoSpan reference standard is cell line-derived, you can use it for analyzing TMB. There are currently few tools available for calculating TMB, but it is recommended to use the entire exome with no fewer than 500 genes – which the OncoSpan sequence files would cover.

Note: we do not provide TMB values within the OncoSpan products.

References

Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14(2):178-192. doi:10.1093/bib/bbs017
Robinson JT, Thorvaldsdóttir H, Winckler W, et al. Integrative genomics viewer. Nat Biotechnol. 2011;29(1):24-26. doi:10.1038/nbt.1754
Quinlan AR, Hall IM. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841-842. doi:10.1093/bioinformatics/btq033
Quinlan AR. BEDTools: The Swiss-Army tool for genome feature analysis. Curr Protoc Bioinforma. 2014;2014:11.12.1-11.12.34. doi:10.1002/0471250953.bi1112s47