“We Need to be Able to Handle Large Scale Datasets to Prepare for This ‘Big Data’ Era” – Mark Gerstein
Welcome to The Short Read, our weekly peek behind the curtain at the people who make this amazing community tick. Make sure to check back every Tuesday for the latest installment.
Professor Mark Gerstein is working as co-director of the Yale Computational Biology and Bioinformatics Program.
Gerstein has published appreciably in the scientific literature, with >400 publications in total, including a number of them in prominent venues, such as Science, Nature, and Scientific American. His research is focused on bioinformatics, and he is particularly interested in data science & data mining, macromolecular geometry & simulation, and human genome annotation & cancer genomics.
Originally, Gerstein wanted to be an architect, but he was drawn to work in genomics because that would mean he would be “dealing with large data sets of fundamental value”.
What are you working on right now?
Right now I’m focusing on bioinformatics. Broadly, I define this as applying computational approaches to problems in molecular biology. this includes large-scale analyses of genome sequences, macromolecular structures, and functional-genomics datasets. It is hoped that these will allow us to address a number of overall statistical questions about macromolecules, relating to their physical properties, cellular function, interactions, and phylogenetic distribution. My lab is especially focused on the human genome and proteome. Our research involves a number of quantitative techniques, including database design, systematic data mining, machine learning and visualization of high-dimensional data. More specifically, we focus on three questions. First, we are interested in annotating the raw human genome sequence, especially in characterizing the vast intergenic regions. Next, we are trying to get the function of all the genes encoded by the genome. Here, we try to characterize function on a large-scale through the use of molecular networks. Finally, for the group of protein-coding genes that have known 3D structures, we are trying to see how their function is carried out through motion and how motion can be predicted from packing geometry.
What’s the biggest challenge you face in your work at the moment?
The biggest challenge is the dramatic increase in the scale of omics data. With the decreasing cost of sequencing, the explosion of sequencing data poses a need for efficient methods of data storage, processing and analysis. It is crucially important that as the amount of sequencing data continues to increase, these data are not simply stored but organized in a manner that is both scalable and easily and intuitively accessible to the larger research community.
Name one big development that you would like to see in your field the next 18 months.
I would like to see a practical approach to prevent personal information leakage from large-scale sequencing data while still letting us compute on this. One idea that has been floated to is to store all the data in a protected National Enclave… Soon, sequencing one’s genome may become as commonplace as getting an X-ray, which makes privacy protection even more important and urgent.
What are you most proud of in your career?
I am most proud of what we have accomplished in the ENCODE and modENCODE project. We have developed many analytical tools to interpret sequencing data from different sources and integrated them together to annotate the genomes for human and model organisms, such as identifying the regulatory elements, annotating variants and visualizing the allele-specific events.
Which scientists, living, dead, or fictional, would you invite to dinner, and why?
The first person I want to invite is James Watson, the co-discoverer of DNA structure. I would like to know his opinions on the current development of genomics. Then I would invite Josiah Willard Gibbs. He might have brilliant ideas regarding molecular motions and how they are related to the functions. The final person I want to invite is Richard Feynman. He was curious about how to read all the information contained in a genome so he must be excited about the current progress in genomics and what has been achieved so far.
What advice do you wish someone had given to you at the start of your career?
I wish that someone had told me how to handle large scale datasets so that I can get more prepared with this “big-data” era. I also wish to have more knowledge in statistics since it is extremely useful to deal with the current and upcoming challenges in our work.
Opinions and views expressed in The Short Read are the interviewee’s and not those of the home institution
Why not check out The Short Read archives?