Genomics and data stuff
I’ve noticed that a lot of the traffic to this blog is driven by queries about genomics and “big data”. I guess I’d better re-post some of the more useful resources. Perhaps I will even make a little mini-series about it.
So, without further ado: Anyone who is interested in the intersection of genomics and (big) data (science) might want to check out the following.
- Life Science Storage & Data Management – A very nice and thorough write-up of storage and management of genomics data. My only quibble is with the title – life science data does not equal DNA sequences and derivatives thereof.
- Grappling with a Big Data Blizzard in Genomics – A presentation by prof. Mark Gerstein about data integration and analysis challenges, more biologically oriented than the previous presentation and perhaps hard to follow in parts for non-biologists.
- Titus Brown’s writings on handling huge (meta-)genomics datasets by tricks like Bloom filters, graph partitioning, count-min sketches etc. Look at some publications (for instance the NSF BIGDATA proposal Low-memory Streaming Prefilters for Biological Sequencing Data; Titus puts his grant applications online for free which is nice), his YouTube clips, and his blog.
- Check out Atul Butte’s work for a demonstration of how “big data” from scientific repositories can be “resurrected”, reanalyzed and integrated to find out new stuff. Also check out his YouTube clips, for example Translating a Trillion Points of Data into New Insights into Disease.
- Jonas Almeida’s presentation on “big data” and medicine. Also check out A Self-updating Roadmap of the Cancer Genome Atlas.
Of course there are many others. I’ll follow up with more material during the next few weeks.