A Witness to the Birth of Cloud Computing
Technology has come a long way in the last four decades. Dr. Peter Tonellato has been able to watch how it has changed computational genomics for the better.
Peter Tonellato has been working in the field of computational genomics since the early ‘80s and so he already had nearly two decades’ worth of experience before cloud computing started to appear. In 2006, when he moved to the Laboratory for Personalised Medicine at Harvard Medical School, he decided that he wanted his lab to trial cloud computing, to see if it could help them lower their costs with great success. In the decade since then, he’s been able to observe how cloud computing has developed and what effect it has had on the field of genomics. We caught up with him to find out what he’s learned.
FLG: What was the driving force behind wanting to move to cloud computing? What was your team working on at the time?
PT: I have a long history of working with computers, going all the way back to time-share computers, through to early supercomputers and massively parallel processing in the early ‘80s. Then in the mid-’80s, I became involved with a number of NSF initiatives aimed at establishing supercomputer resources that could be used by investigators all over the world.
By the time I started using cloud computing, I had 20 years’ experience of working with early versions of distributed computing hardware, including purchasing, installing, and configuring two separate compute farms. Those were truly excruciatingly painful exercises that were very expensive, very time-consuming, and always required significant administrative support. We had one or two full-time administrators constantly, who administrated our systems but also had to manage hardware maintenance.
With that history behind me, particularly building the two farms using NIH and NSF funding, I was always on the lookout for more commodity-based computing solutions. In the latter part of the ‘90s, as an applied mathematician and an early investigator of high capacity computing for large biomedical and clinical data analysis, I constantly had to deal with these costs and complications.
For those early virtual environments, we’d typically have a virtual Unix engine running on an Intel chip, using an operating system that was generally Microsoft. That’s what the initial concept was. On top of that, I was using Oracle, which would create its own virtual environment to conduct relational database management activities and transactions. At the time you didn’t know which computers your transactions were taking place on because Oracle was clever enough to take care of all of those administrative chores for you. You just had to install Oracle, tell it which resources were available and give it permission to use them, and then it could take over them as needed.
In 2001, I started a company and I was faced with the same issues again. As a result, I was attuned to what was happening in the space as a purchaser and user of such resources and I’d been following technological advances since the ‘80s. When virtual engines started appearing as a commodity that was offered at commodity pricing, I heard about it quickly and I looked into it very carefully for my company. At that time, most of the services weren’t so many virtual engines as they were companies with large compute farms who would lease customers a portion of that farm. If you wanted to lease ten compute servers long term, for example, they’d give you a lower price and then if you wanted to have another six servers temporarily, they’d make them available with higher short-term pricing. That was really the early model of flexible pricing.
Then, about two years later, I was recruited to Harvard. I’d been watching how cloud computing had evolved and I’d decided that I wanted to establish my lab on a virtual platform. At the time, AWS were selling those resources but they didn’t quite have their business model together yet, which kept it interesting.So by the early 2000s, both the business and the technological models of cloud computing were starting to emerge and I think that’s about when AWS was first formed. Because I was in the market, I was aware of it at the time but I decided that, for my company, it was a little too early to adopt. It was still unclear where it was going to go and whether it would be a stable version of virtual environments or more of a flash in the pan scenario.
To answer the question explicitly, the driving force was the constant desire to have a flexible computing resource environment at a minimal cost. In particular, I wanted to avoid paying full price for large compute resources which spent most of the time sitting around and doing nothing. Anything that came along which gave me that flexibility, I was willing and anxious to look at.
As for what my team were working on, they were looking at very high capacity databases with interfaces on the early version of the web, to allow scientific investigators to select out subjects that they wanted to examine further. For the most part, it was dealing with sequence data and the equivalent of modern-day electronic medical record data. The interface would let the investigators find the information, pull the data down, and then execute the computing on either their local platform or the platform we made available to them. That was a pretty classic kind of resource-as-a-service type platform. I was always working on that, including integrating data with that virtual environment to make it available to investigators worldwide and providing them with the storage and compute resources to conduct their analysis.
The other thing that we were working on, which we’re still working on, involves large computation of biomedical phenomena. We create large, synthetic populations of patients and then conduct simulations on them, such as simulating their participation in a clinical trial, their response to a drug, or their predicted clinical outcome. Those are very high-intensity computing simulations, but they’re also very acute simulations. You can spend months developing a model, testing the code, and doing small validation simulations, before scaling up to a simulation with 1,000,000 subjects which uses 1,000 nodes over the space of a full week. The early development can be done on a desktop or a single node server, but the full-scale simulation itself needs vast quantities of compute power. My lab always had that dramatically changing need on an on-going basis, with small, background compute and then massive spikes during simulations, for multiple projects across any given year. That was just the nature of the work that we were doing. The problem was that, even though our needs were acute, we had to pay for the maximum level of resources all the time because there was no other model available to expand and contract those resources. So that was the primary driving force, that flexible need driving up costs.
Read the full interview, and learn more about cloud computing in our free guide that introduces you to some of the most important features for genomicists in the cloud.