toolkit

Developed by the Data Science and Data Engineering group at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping / Broad Institute

The Broad Institute of MIT and Harvard has announced that they’re releasing Version 4 of their Genome Analysis Toolkit (GATK4) under an open source software license. An alpha preview is currently available through the Broad’s GATK website and a beta release is expected in mid-June. The package contains new tools to improve genomic analysis and a rebuilt architecture for smoother performance.

GATK4’s new framework enables a number of performance improvements. These include better parallelisation, optimisation for cloud deployment, and easier, faster, and more efficient data analysis. Previous versions of the toolkit have been able to identify single nucleotide polymorphisms (SNPs) and indels; GATK4 can now also detect and label copy number and structural variations in samples.

“We wanted to remove traditional barriers of scale while offering the same high level of data quality our users expect,” said Eric Banks, Ph.D., Senior Director of Data Sciences and Data Engineering at the Broad and one of the creators of the original GATK software package. “Thanks to the rapid adoption of cloud computing, researchers can finally do away with many of the infrastructure-related complications that have hampered progress, especially at smaller institutions and start-ups.”

One of the contributing factors that has enabled the Broad to release the software as open source is a partnership with the Intel Corporation. The collaboration lead to the improvement of high-performance analytics which allows researchers to study massive datasets, a very important part of genomic studies.

In the same announcement, the Broad’s Data Sciences Platform (the team behind GATK4, DSP) has revealed that all future software packages will also be open source. In doing so, they hope to help improve progress in biomedical research for scientists around the world.

The majority of DSP software packages were open source in the past, but developing GATK was a learning experience for the team. After initially struggling to provide sufficient user support for such a large number of users, they tired licensing the package to an outside vendor for additional support. When that approach was found to be unsuitable, they enabled users to work directly with the team at Broad instead of a third party. Now, many of the user support problems the team encountered have been solved with advances in technology, so it makes sense to return to the open source format the DSP team are used to.

“Releasing GATK4 as open source was the obvious next step for our team,” said Geraldine Van der Auwera, Associate Director of Outreach and Communications within the Data Science and Data Engineering group at the Broad Institute. “We believe it’s the most effective way to support the community, and we hope it continues to grow, innovate, and help researchers make insights that are essential for future human health breakthroughs.”

The news has been well received by researchers worldwide, including the more than 45,000 users of GATK.

“It is critical for progress in biomedicine that the software we use for analysing the genomes of millions of people is robust and well understood,” said Ewan Birney, Director of EMBL-EBI and Chair of the Global Alliance for Genomics and Health (GA4GH). “Releasing GATK software with an open-source license directly supports open innovation, data re-use and data re-analysis in the global biomedical community.”