Genomics at Scale – Building Reproducible, Scalable Workflows for Genome-Based Medicine
When considering the clinical applications of computational genomics, the true challenge stems from both the nature of the processed genomic data and the purpose behind processing it. The clinical context renders analyses useful only if their results meet rigorously defined standards and can be provided on time.
Scalability to tens of thousands of samples is a key issue in genetics. It is important to automate the process that includes variant prioritisation, variant description, and reporting following recommendations from bodies like the Association for Molecular Pathology and the American College of Medical Genetics and Genomics.
Reproducibility and precise versioning are crucial aspects to be met according to the College of American Pathologists (CAP). If the pipeline is precisely versioned and complete, it means that it’s set to run in the cloud without requiring supervision of a bioinformatician. Versioning also allows the user to manage upgrades and documentation which are both included in CAP Next Generation Sequencing (NGS) Laboratory Requirements.
In recent years, several languages have appeared that are specifically dedicated to define bioinformatic workflows on platforms in an agnostic manner, such as WDL, CWL or Nextflow DSL. Pipelines defined in those languages can now be run on cloud computing platforms including AWS, Google Cloud Platform, Alibaba Cloud and more, thanks to engines like Cromwell or Nextflow. At Intelliseq we use WDL language, which allows our pipelines to be run not only on generic public clouds but also on genomic specific clouds like DNAnexus or DNAstack.
One of the crucial ideas that pushed the field of computational genomics forward was containerisation. All tools can now be encompassed inside lightweight, semi-virtual machines called Docker containers. This procedure allows tools to be run without installation on any platform that has Docker Engine installed. It’s enough to download Docker image and all the required tools for computation and run it in a local or cloud environment. At Intelliseq, we include in docker containers not only all the tools, but also data sources required for computation – like reference genome or variant annotations. This allows us to achieve precise versioning of our pipelines, thus, meeting CAP requirements for bioinformatic pipelines.
How do you prepare for scaling in advance? Rather than developing pipelines each time from scratch, use already-developed ones. Once the base pipeline is established, it can be customised for a client with a fraction of the cost of development. Building a specialised lab team of bioinformaticians to develop genomic pipelines can be compared to building a team of electrical engineers to set up power generator for a lab. It’s much more cost efficient to cooperate with companies like Intelliseq that have already established a portfolio of workflows. Those workflows can be easily customised according to laboratory requirements and then integrated with Laboratory Information Systems.
What does the process of workflow development look like? It begins with a specification of requirements. Then, the team of Intelliseq scientists propose an outline of the workflow including already developed computational tasks as well as those that need to be developed. Intelliseq performs a large collection of procedures like quality checks, alignment, variant calling, variant annotation, imputing, and polygenic score computations. Tasks are connected into fully functional workflows. Report generation can be included in the pipeline or it can be produced by other vendors based on results coming out of pipeline.
At Intelliseq, we specialise in building complete workflows from Fastq to reports encompassing recognised software on public and commercial licenses and proprietary software (POIR.01.01.01-00-0213/15 “GeneTraps – genome sequencing interpretation system dedicated to precision medicine”). We offer expert consultancy about the optimal implementation model. You can reach us at www.intelliseq.com.