Paul Agapow, Researcher in Translational Biology, Data Science Institute, Imperial College London

Beginning his career as a biochemist, Dr Paul Agapow has worked in Australia, South Africa and the UK on the application of informatic approaches to tough problems in biomedicine. Lately, he is based at Imperial College London, focusing on the possibilities for big data in biomedicine, using computationally intensive approaches including deep learning and graph databases.

Paul will be speaking at Festival of Genomics London on day one on Stage 4. He will be the chairperson for the talk entitled ‘Integrating Rich Biomedical Data Sets in the Era of Machine Learning & Advanced Analytics’.

This is what a typical day looks like for him in his own words. 

Medical science is young. Germ theory and the very idea of infection hales from the mid- to late 19th century. Blood groups and antibiotics were discovered in the early 20th century, the working of the immune system started to be unpicked in the 1940s and 1950s. Genetic sequencing emerged in the 1970s, while the first genomic association studies (linking genetic variation to a visible trait or behaviour) took place only in the early 21st century. Which is to say: medicine is a boon but we still have a lot to learn about disease and how to best diagnose and treat it.

Answering these questions can be difficult and the problems are often ones of big data and statistics. Many traditional treatments and ideas that were “obviously” true have been proven wrong under systematic investigation. The causes of disease can be obscured by the differences between patients – on the genetic level, in their life history, in their environment – or just by sheer random chance. Blatant smoking guns are rare in biomedicine, so we need to assemble evidence by looking at combining many types of evidence across many, many patients, with rigorous use of mathematics and statistics. Biomedical science is now data science. This is where I come in.


I try to sneak into work before the peak-hour rush, partly to avoid the crowds, partly so I can get in a few hours of focus before the turmoil of the day settles in. Today, I’m using those hours to play with some maths, exploring algorithms for clustering. The aim is to find out what’s the best way of grouping together patients for further medical investigation or treatment.

When asked what I do for work, I sometimes answer “Attend teleconferences and answer email.” If I must be serious, I’ll say “Data- and computation-intensive biomedical research.” When asked what for, I reply “Precision medicine.” This is why I don’t get invited to many parties.

A group of patients with the same set of symptoms, might have entirely different diseases or the same disease but in different stage. Conversely, the same disease might manifest differently in different people. Precision medicine is about better recognition and definition of diseases, moving from ambiguous clinical presentations to what is happening on a molecular level. The aim is to deliver “the right treatment to the right patient at the right time”. For example, asthma appears to consist of 3 or more (sub-)diseases with: typical childhood asthma, more severe adult-onset asthma, asthma triggered by smoking or the environment. Each of these sub-diseases look and behave very differently on the molecular level. If we use genetic and other molecular data to better diagnose apparently similar patients, they can be treated with specific, targeted treatments rather than best-guess, one-fits-all regimes.

Grouping similar patients seems like a simple problem, but quickly becomes complicated. How do you measure similarity? What sort of data do we use? What happens with missing values? What do we make of patients that fall between two clusters? There’s a lot of literature and prior art on the subject, so you could spend a lifetime working on just this problem. But it’s only the first step to better treatment and there are other things that need working on, so I must be content with finding a decent, not perfect, solution. As I’ve commented to colleagues, we have to be experts at being “just good enough”.


Reading about algorithms and doing the math has made me itchy for using some of these methods on real data. I’m particularly intrigued by “multi-omic” methods, approaches that use several types of data (e.g. genetic, protein, clinical signs, metabolic) in a single analysis. There’s a growing consensus that we need this sort of approach, because different levels or compartments of physiology capture different processes and timescales. If you look at only one, you’re only seeing part of the story.

While I start to assemble a dataset for analysis, there are problems almost immediately. I’m trying to merge two datasets but there are issues with the units they use. Different tools have different expectations or assumptions about what data will look like, but data can be collected or marked up in a whole lot of different ways. For example, blood pressure can be measured in millimetres of mercury (mm Hg) or kilopascals (kPa). The number of white cells in blood can be measured as cells per millilitre (or microliter or litre), or grams, or percent. Seemingly simple issues like this can be huge stumbling blocks in the way of analysing and comparing data. Fortunately, my workplace has a lot of experience in this “data harmonisation”, so I deploy some of the tools we’ve deployed to wrangle the data into a useable format. But the medical field is only just waking up to the issues of data standards. A lot more work needs to be done to standardise biomedical data and put it in a useful format so it can be fully used.

The next step is to run the analyses. Fortunately, the college and institute has good well-maintained computing clusters. Some of the algorithms we use would be impractical without not only powerful computers, but specialised computing architectures that let huge problems be broken up and analysed in parallel by a cluster of computers. While programming for these environments used to be incredibly complex – I’ve wasted months mastering arcane programming idioms only to see my resultant code crash and misbehave in random ways –  new approaches like Map-Reduce and Spark have made it much more accessible. I set the code running and make a note to look at the results tomorrow.


It’s a bad idea to eat at your desk, but I end up doing it anyway, paging through the day’s emails. Thankfully there’s good news in there. A large trans-European grant that we’re part of, focused on prostate cancer, has been approved. It’s an exciting project for a number of reasons: despite the prevalence of the disease (in Europe, it’s the most common cancer in for males, and the third most common overall, with nearly half-a-million new diagnoses a year), it’s poorly understood. What determines outcomes is opaque: why does one person develop a benign case and another a severe one? This project will unify data on prostate cancer across dozens of separate labs and systems to build a single resource for mining and investigating the disease.

A huge part of my funding has come through EU sources, which seems to be more open to broad and fundamental, but costly, projects like these. What will happen in the future is unclear. There’s no doubt that Brexit caught academia by surprise. UK research groups are heavily poly-national and we’re accustomed to thinking of a border-less world, or at worst of borders as things to be circumvented. The full impact is yet to be felt, but there’s a lot of FUD (fear-uncertainty-doubt) in the air. Job applications from overseas have started to slow down, foreign scientists in the UK are wondering if they should leave. Only time will tell.


I’m interviewing a prospective student via Skype. Experience forces me to contain my expectations: the bottleneck in this sort of research isn’t lab work, it isn’t computation time, it’s getting the right sort of people. To work in this area you need to be comfortable with maths & stats, to be able to program and to know the biological domain under investigation. You don’t have to be an expert at all of these, but at least need a working knowledge.

I was fortunate in that my career started in immunology, before stumbling into computing. It’s a powerful mix of skills that has been very useful but finding that mix in the wider population is rare. The problem is exacerbated by industry drawing away a lot of talent, and universities being poor at supporting and promoting technical careers.

Fortunately, the interview goes well: the candidate has an engineering background but with postgraduate courses in systems biology and medicine. And they’re keenly interested in the biology. I send them a clutch of relevant papers, start the paperwork for getting them on-board before discussing possible projects with a colleague.


I look at what’s new in the preprint archives. A hoary cliché of the scientific life surrounds academics practically living in the library, poring across journals and dusty tomes, to keep up with the latest research. In reality, the college library has a decreasing number of books and I haven’t laid a foot in there for years: almost all my reading is done electronically. Further, given the explosive growth of academic publications and scientific knowledge in general, keeping up is only possible in the loosest sense. It’s a good reason for having a wide and diverse network of collaborators: together we’re more competent and can cover more than we can apart.

Many of the interesting publications are coming through preprint archives like arXiv and bioRxiv. Founded in response to the slow and inefficient process of traditional academic publication, they host a diverse mix of draft manuscripts, incomplete thoughts and frankly unpublishable work. Which is what makes them so interesting: I get to find out what’s happening now (rather than in a year or more when a manuscript has wound its way through a painful publishing process) and to see ideas that won’t appear in a journal, because they are odd, unfinished or unglamorous.

Downloading a few papers to read, I wonder when computers are going to take over and start finding and reading papers for us. It can’t be long.


One advantage of London is that there’s an endless – even excessive – number of talks and meetings for you to attend. Tonight, it’s the turn for my local tech group, Bioinformatics London. Set up a few years ago by Nathan Lau (from Queen Marys) and Stephen Newhouse (from Kings), it’s had great success at drawing together a set of people who often work in isolation, feeling that none of their colleagues appreciate what they do.

Tonight, we have a talk from the startup Seven Bridges Genomics. They’re pioneering an approach called “the graphical genome”. Where we traditionally talk about the human genome and any location within, it’s as a “reference”, a sort of average of a set of genomes. It’s a useful model but genomes are ever-evolving, ever-changing things with chunks of genetic material constantly being copied, inserted, deleted and shifted. A simple consensus genome isn’t wrong but hides much complexity and variety. Seven Bridges is building tools to treat the genome more correctly as a mesh of pieces and relationships. This could be a huge game-changer: every tool that uses this improved graphical reference will be more accurate. There’s a constant flow of tools and software that promise to revolutionise our work, so everyone is initially sceptical. But Seven Bridges show graphs and performance figures that look very promising. I see some possible applications in infectious disease work, and during the talk fire off a quick email to a colleague who might be interested.

As usual, we adjourn to the pub afterwards. There’s lots of talk of job and funding opportunities, industry gossip, cathartic complaints about difficult collaborators and oblivious bosses. More than we’ve joked about the group talks are just an excuse for the pub. It’s a good way to end the day, before heading off home.

Paul will be speaking at Festival of Genomics London on day one on Stage 4. He will be the chairperson for the talk entitled ‘Integrating Rich Biomedical Data Sets in the Era of Machine Learning & Advanced Analytics’.