NGS Quality Concerns
By Hans Karten, CEO/CTO at GENALICE
We see a consistent wave of innovations, ideas and new products in the genomics NGS space. It means that this is serious business and, more importantly, it means that when man shifts its mind into gear, there seems no limit to what can be mastered.
A main theme, which keeps coming back, is quality with a number of interesting angles and approaches. Time and space does not allow traversing the full quality spectrum including user interfaces and point devices. In this writing we will concentrate on quality of content, and not on deployment and application.
Why so important?
For research, quality means one can make solid new discoveries. Better sensitivity and precision of variant detection, better coverage, lower noise levels at the data source and larger cohort sizes improve the ability to find biomarkers and disease causes. Very important as clinical application of new, shelved and approved drugs can be much improved by finding more precise qualifying or disqualifying criteria (predictive biomarkers).
For the clinical application itself, it means a clear and reliable exposure of clinically relevant mutations. Excellent quality of data is a required foundation, yet in general the focus is not on new finds, but on high certainty to find known mutations. A physician can only prescribe registered or patient consented drugs, and these are related to ‘known’ DNA profiles.
So coming from different perspectives, both fields are well served with improved quality of data, the foundation for decision-making. Initiatives to get to higher quality levels are manifold. I like to touch briefly on new chemicals, ‘precision’ reference, long reads and population calling.
New chemicals can have a very interesting effect. We have seen quite a jump in quality resulting in improved concordance and precision (GIAB 2.18/2.19) for Illumina HiSeq X samples we processed post November 2015 (even with 30x!). What is interesting in these samples is that we see more INDELs being picked up by multiple pipelines.
This could mean that these INDELs are not in the ‘truth’ set, as it is clear that the truth set has been compiled with previous versions of the chemicals. This could provide a new pool of mutations to investigate. It could also mean that more non-random errors are introduced in e.g. repeat areas. That would introduce false positives, which will blur the picture.
A precision reference is another method to improve quality. This is a reference, which is ‘closer’ to the target data than a general reference. GRCh38 e.g. provides alternative haplo-blocks of both human and non-human sequences. These blocks can be used as decoys for reads, which may otherwise cause a mapper decision conflict. A precision reference would ultimately be generated from a graph-based reference which is designed to unify all species in one genetic ‘Tree of Life’.
By using a reference, which has the shortest genetic distance to the sample, mapping short reads to the reference will be less sensitive to errors and the delta will give better signals. Combining ‘better’ chemicals with this approach would not only give crystal clear information but could also reduce the coverage requirement which makes the process quicker, cheaper and more predictable.
Then we find ourselves cornered by the difficulty of short reads mapping to repeat areas. Chemicals and precision references fall short in solving that issue. People have put effort into defining e.g. a Genome Mapping Score (GMS) to guide mappers. The goal of such efforts is to prevent error. In itself a good thing, still the effects of pseudo genes and function of repeat areas prior to coding areas remain under investigated.
For these areas, long reads are the answer. They provide ‘context’ allowing better informed choices. Products from Oxford Nanopore, PacBio and 10X Genomics are getting the spotlight to fill this gap. Some may even challenge the dominance of Illumina in the sequencing space. Of course that will not happen overnight, but then again in an exploding and very fast moving market anything goes. I still remember the early 90’s when Oracle, being 13 years in business, almost went bankrupt in a fast growing market which hit an unexpected slowdown. A perfect storm is often predicted retrospectively.
Back to our ‘quality’ topic, long reads do improve the mapping quality and contribute to the overall quality of the foundation for analysis. Right now, I think a blend of short and long reads with the same ‘signature’ will work best. The 10X Genomics approach could work well using pseudo long reads; provided they overcome the hurdles from concept to stable product (as any company in this space) and find themselves capable of fast enough and accurate data processing to be able to compete with the likes of PacBio and Oxford Nanopore. Right now, mainstream application for long reads lies in research and plant genomics. Having said that, PacBio is making a huge effort to gain market share in the clinical space. Main theme is increased certainty to detect clinically relevant mutations.
The power of Population Calling
We close with Population Calling. This is a method that encapsulates all of the above. It is both the cherry and the bottom of the pie. The method takes multiple samples and blends them together in a single Variant Map. The Variant Map provides a context for each sample to reflect. This allows for quality and certainty elevation, independent of the source of the data, chemicals, reference or short/long read choices.
When used for same source samples, it will be easy to detect and filter out artifacts e.g. introduced by new chemicals as explained earlier. It also allows filtering out systematic errors stemming from a sequencing platform (e.g. blended long reads) or a pipeline. Population calling is a positive double-edged sword. It improves signal quality by context driven signal amplification and suppression. In doing so, it also highlights unnatural patterns. This could be caused by errors in the reference, or artifacts of the data production or data processing system. The Population Caller will provide both a genetic profile and a filter, which can be used for single sample calling. It provides a direct and specific quality improvement of any pipeline.
When dealing with the constant wave of innovations in the genomics space of which we highlighted a few obvious, one needs a safeguard to catch the unexpected. Population Calling provides all the means to measure, improve and assure the quality of the end result. To me it is a no-brainer that every quality lab, every quality research center (human, plant and animal) and every quality hospital should use Population Calling for quality assurance of their NGS data processing pipeline or have it embedded in a purchased service.
Agree or disagree? I would love to hear your opinion …
More on these topics