Analytics challenges in genomics

Continuing on the theme of data analysis and genomics, here is a presentation I gave for the Data Mining course at Uppsala university in October this year. It talks a little bit about massively parallel DNA sequencing, then goes on to mention grand visions such as sequencing millions of genomes, discovering new species by metagenomics, “genomic observatories” etc, then goes into the practical difficulties and finally suggests some strategies like prediction contests. Enjoy!

1.2 zettabyte of data

OK, so I was a bit slow to discover this, but The Economist has a special report on big data which is freely available online. That is, the individual articles are free, and a PDF compiling them is supposed to cost 3 GBP, but I was able to download it for free here without doing anything special.

A fun fact that I learned from this report is that the total amount of information in the world this year is projected to reach 1.2 zb (zettabyte) – which is 1.2×10^21 byte. How on earth did they come up with that figure…? Anyway, this report is worth a read, as it touches on things like business analytics, web mining, open government data and augmented cognition, while also giving some well deserved love to R and open source software.

Open government and municipal data

Several projects having to do with open-access data from governmental and municipal sources have been announced in the past few months. Governments are, of course, already using analytics for a lot of things – this post from A Smarter Planet talks about how the Social Security Administration in the US is “using analytics and predictive modeling to make quicker determinations on disability applications for those in need” and how the US Postal Service is “extracting valuable insights from information on mail delivery to improve on-time delivery performance” (a postal service that improves – now that’s a novelty!), but now (part of) these data will be available to anyone.

In the US, federal data are or will be made available at The site seems pretty well-designed, with data sets being queriable online and raw data being downloadable in XML, comma-separated file and other formats. Among the data sets at, there are some interesting health-related ones like the National Cancer Institute’s Surveillance, Epidemiology and End Results (SEER) database.

The City of San Francisco has its own data clearinghouse, dataSF. Strongly supported by mayor Gavin Newsom, it was launched in August. Given the huge population of programmers in SF, it’s no surprise that mash-up applications using the data started to appear quickly, e.g. “Routesy, which offers directions based on real-time city transport feeds; and EcoFinder, which points you to the nearest recycling site for a given item” (see this Guardian article for more.) New York City has its own NYC Data Mine, which has come under some criticism. It is hosting the BigApps competition, which is meant to “stimulate innovation in the information technology and media industries, and attract and support developer talent to develop web and mobile applications (apps) by using City data.”

The real pioneer in open government data, though, seems to be not the US but Australia. The Australian government recently arranged a “hack day” named GovHack, where they invited programmers to develop mashups and applications around government data during 24h intense hours. Some of the projects that came out of this event were:

The overall winners LobbyClue, by a team comprising members many of whom had never met before the event. LobbyClue is an in-depth visualisation of lobbying groups’ relations to government agencies, including tenders awarded, links between the various agencies, and physical office locations

It’s buggered, mate, In true Australian style, allows you to report buggered toilets, roads, etc, with an easy-to-use graphical interface overlayed on a map. Their idea was to combine this with local government services to fix issues in the community. Built by a team of developers from Lonely Planet.

Know where you live, a stylish presentation of ABS data (along with Flickr Geocoded photos), pulling in relevant information for a particular postcode: rental rates, average income, crime rates, and more. Built by a team of developers who work at News Digital Media.

Pretty cool, isn’t it?

