Dr Ben Langmead is a computational biologist and assistant professor in the Computer Science Department at Johns Hopkins University. He is most famous for his creation of the Bowtie and Bowtie 2 sequence alignment algorithms, used to improve sequencing alignment quality. FLG spoke to Dr Langmead about his lab, his recent work using the Stampede2 supercomputer cluster to optimise sequencing data analysis software, and the future for DNA sequencers as a whole.

 

FLG: Could you Tell Us About Your Lab, and How You Got into Genomics? 

BL: Everyone in my lab is a computer scientist with different interests – some are more high-performance computing, some are more applied algorithms, etcetera. The reason we’re so interested in genomics stems from the revolution in genomics which occurred when second generation DNA sequencers started coming out. For my part, I was starting graduate school when I first became interested in this area, in 2007 – this was right around when UK company Solexa was making next-generation sequencing instruments.

People were just starting to realise the algorithms they’d been using for a long time, including older models like BLAST, were simply unable to keep up with the rate at which the new sequencers were generating data. So there was a new push, and a lot of new blood started entering genomics from computer science when they realised we needed much faster software and a plan to deploy it in settings where many computers can be used simultaneously, and to analyse these datasets.

This was around the time when I started working on Bowtie and Bowtie 2, which solved the problem of having to attempt to place every read correctly in the overall puzzle of the genome. What Bowtie did was use the human genome project’s reference genome as a template for aligning these reads correctly – using the reference genome almost as a picture we can look at when assembling the jigsaw puzzle of a genome.

Bowtie 2 was next, and that was more appropriate for longer reads which were by that point standard in the more modern second generation sequencers. That became very widely used. Both the tools were fast, efficient, and used fairly recent advances in computer science and text indexing.

These were the tools we were recently working with on the Stampede cluster, and our goal was to make this software use the huge number of simultaneous threads of execution which are now possible both with the modern Xeon systems, but more radically with many-core systems like Knights Landing.

 

FLG: Your Work with the Stampede Cluster was Devoted to Improving the Utility of Read Alignment Tools. Is There Still Room for Improvement? To What Extent?

 BL: There’s still a lot of work to do! Read alignment tools are pretty good at being fast and memory-efficient, but where they suffer is on the interpretability side. There are two big problems we’re currently facing.

First, read aligners make mistakes, and can put the wrong read in the sequence. This is such a problem because genomes are very repetitive, the sequencers can also make mistakes, and every human is different – they won’t be exactly like the reference genome. All these confuse the aligner and make it put a read in the wrong place.

Because of this, the aligners make an attempt to characterise quality of read alignment. This procedure is not very well studied, and so how the aligners calculate this quality is ad hoc. There’s not a lot of literature on how to calculate the mapping quality, and it’s very complex because these tools have heuristics – they’re trying to find where these reads correctly go, but they’re also taking shortcuts to speed up the process. Somehow the mapping quality has to understand these shortcuts to produce an accurate characterisation of this uncertainty. That’s hard, and while there are hundreds of papers describing how to do read alignment quickly and efficiently, only a few show how to calculate this read accuracy.

Second, the notion of a reference genome is already flawed, because we know individuals are different. The problem is actually greater than this arbitrariness. The human reference genome is just one copy of every chromosome, but each human has two copies. This is particularly a problem when scientists want to study the difference between the two copies, for example if they want to study whether a gene is being more expressed from the paternal or maternal copy. The problem is if one of the copies is a closer match to the reference than the other, the aligner has an easier time getting an answer for the closer match. So it can be strongly biased towards one copy.

There are also hypervariable regions of genome, for example areas relating to immunity, which can be significantly different from person to person. Because of this looking at the reference genome is not representative enough if you want to understand where reads should go.

Subsequently discussions have been held around having a panel of several genomes instead of just one reference one. These could either be separate, with individuals studying those alignments separately, or potentially they’d be combined into one genome and it would no longer be a string, or just one sequence, but instead a graph or a path that can diverge and rejoin in places where the individuals differ or are similar. That way, all the different variants could be represented on this graph.

It’s hard, however, to categorise exactly what this graph should be, and to adapt all our algorithms used to dealing with strings to instead deal with this graph. There’s a lot of work on how to do this. This is a whole new gen of the problem of read alignment, and a lot of the problems we thought were solved around memory efficiency and speed will have to be revisited in the new paradigm.

 

FLG: Do You Think These Issues are Solvable in the Next Five Years? In Ten?

BL: The technology is always shifting. In addition, some of the above shouldn’t be thought of as problems, more opportunities – only recently did we have enough population genetics information to figure out what the graph should even be. We needed to know the genetic variants in enough people before we could entertain the idea of moving beyond the single reference genome. Data will keep getting larger, and the size of the genome set will keep getting larger in read alignment – ie in the UK there’s a huge sequencing effort going on, and similarly large efforts in the US for veterans and population as a whole. Sequencers themselves will also keep getting better, which will also have a big impact on read alignment – if a sequencer is designed that can read whole chromosomes from end to end without mistakes, the read alignment problem is gone!

I don’t think that’ll happen, but certainly sequencers are getting better at generating higher and higher quality data, and generating longer read snippets. As that continues, some problems will become easier to handle. To summarise, I think the technology will make things both harder for us by giving us more prior information and input data that we can take advantage of, but also make it easier by making the computational problem we’re solving less ambiguous through fewer mistakes.

I seriously doubt these problems will be solved in five years. Over a longer term… it depends on trends that are too hard to perceive, particularly on the biological side. There are also computational trends that could interact with this, ie the prevalence of non-volatile RAM could make it possible for people to build much larger index data structures. Some of the problems with the computational blow-up that happens when we have to deal with many reference genomes instead of just one could be alleviated by that. So it’s pretty hard to predict. I think it’ll depend a lot on where biotech and computing trends go in the future.

 

FLG: What Is Your Lab Working on Now?

BL: We’re dedicated to solving whatever computational problems are immediately downstream of the sequencer. Read alignment is the dominant one since we have so many good, high quality reference genomes, and because it’s much easier to align to reference genomes than to assemble the genome from scratch. So we’ll keep working on that part of the processing pipeline for sequencing data.

We’re also very interested in trying to make public sequencing data easier to use, as there’s a lot of data from completed sequencing studies which is available in public archives in the US, UK, and Japan among others. All these datasets contain sequencing data which is potentially very valuable to researchers who might not be able to obtain the same biospecimens, or have no access to sequencers, or lack the skillset in the lab to generate that data. So there’s a lot of valuable data, but right now it’s not very useable.

Another major thread in the lab is trying to build the sort of software systems and resources where we can take public data and then analyse it, summarise it, and make those summaries queryable so people can more easily use public datasets.

So we’re interested in both what we do immediately with sequencing data straight from the sequencer, and how we can make archived sequencing data easier for everyone to use.

For data scientists looking to understand the importance of their work in the context of life sciences, Dr Langmead runs a class on algorithms for DNA sequencing on the Coursera platform. His lab also freely distributes teaching materials, including lecture videos, screencasts and programming notebooks.