scio10: Science in the Cloud

Jan 16 2010 Published by under Conferences, open science

John Hogenesch, Assistant Professor of Pharmacology - Penn School of Med

gene-at-a-time is giving way to genome wide - larger datasets, collaborative research

last year more added to genebank than all previous years combined (wow!) - exceeds Moore's law.

Academia responds by buying storage and clusters - but you need great IT staff - and it's really hard to get and keep them (they go to industry), heating & cooling, depreciation, usage/provisioning (under/over utilized). Larger inter-institutional grids - access is tightly regulated, they are very complex to program in/for

Cloud computing: software as a service, infrastructure as a service, platform as a service

They use SAAS for collaboration - basecamp from 37 signals. Collaborating with multiple labs, multiple people. Compare $50/month with no IT support costs to sharepoint $1k server, $500 license, admin 5% effort $2k.

IAAS for proteomics - example - search complex samples over 6 frame translated genome. They provisioned 20 AWS nodes, running windows, conducted over 7 days at a cost of $1400.

In genomics - lots of recent publications using cloudburst, crossbow (?), and hadoop for blast/blat/r scripts....

BLAT on AWS - using CloudCrowd (NY Times alternative to hadoop), provisioned 20 large memory instances of ubuntu, 85% of sequences were mapped, ~72 hours/$424 (experiments cost $30k with machine and reagents and all - so over the course of the 30 you can do in a year, 600k savings)

q: how much programming to get it ready to go on AWS?

a: about 8 hours with a somewhat experienced programmer - a very experienced on could do it in 1-2hours - programming is done in Ruby

PAAS - aggregating clouds - genome wide screen for modifiers of the circadian clock , 300 found, (Zhang et al Cell, 2009), gene cetric data integration - go to each data site and search for your gene and then compile. ID/synonym resolution is hard. BioGPS - federated search of these gene sources - URL based scheme, extensible. Puts results from different sources in boxes on BioGPS. Has a catalog search so you can see if you can buy from Invitrogen (sponsor, thank you!) and others. (http://biogps.gnf.org/circadian)

PAAS use case - publishing in the cloud - Plos Currents Influenza. pmids used for references, google knol to write, moderators decide suitable/unsuitable - not review. PLOS will consider expanded versions in their pubs. ~52 publications so far. Example has been viewed 7k times.

q: biobase - only mammalian?

a: yes, but code is available (.net) so you could customize

q: small vs. large institutions - does this help people who are under resourced for equipment

with this we can give you the algorithm and then you could run it on the same service - so this is different from just sharing algorithms

q: writing grants etc. how does that go with cloud services?

a: capital costs (buying servers) is typically out of a different bucket so this might complicate. Some in the room have had success, no problems. Some have met skepticism. In the UK they're very concerned about the PATRIOT act provisions.

q: do you need an AWS specialist

a: they had someone with an MS in bioinformatics and a bs in bio - picked up how to do the first in a week, second done in 8 hours. Could probably replace that person fairly easily

q: concern with using a free service online - stability/preservation of data

a: test to see about getting data out after you set up an account, if super important then host on your own site

q: using these in teaching?

a: using wave, using pbwiki, using blackboard, using open wetware wiki, (i use OneNote), also googledocs (they tried wikis first, didn't fly, googledocs works well for them)

q: proportion of work done in cloud vs. local computing resources

q: boundaries of the institution

a: now either academic or industrial - so this will probably  allow independent investigators again, rent some lab time, rent some computing time and then prototype something. Can also use publically available data - always lots more things to find/use it for than just what originators foresaw

One response so far

  • This is an important development. I see the power of information in the cloud extending to all areas of research, and even medical diagnosis. Thanks for the post.