Follow the Data

A data driven blog

Archive for the tag “Links”

This & that

  • The BigML blog has been on a roll lately with many interesting posts. I particularly liked this one, Bedtime for Boosting, which goes pretty deep into benchmarking various versions of the boosting algorithms we all know and love (?).
  • Mark Gerstein of Yale University has a nice slide deck about the big data blizzard in genomics (<– pdf link). There are lots of ideas here about how to build predictive models based on, for example, ENCODE data. I won’t get into the ongoing controversy around ENCODE here, suffice to say that I think the ENCODE data sets are a good resource for starting to build statistical models of genomic regulation on a larger scale.
  • The O’Reilly Radar has a good post about how Python data tools just keep getting better.
  • An “ultra-tricky” bioinformatics challenge will be run by Genome Biology on DNA Day (April 25), with a “truly awesome” prize. Intriguing.
Advertisements

Summer reading

Some nice reading for the summer (in case of a rainy day of course):

  • Prediction, Learning and Games (PDF link) – Nice textbook on prediction. Via @ML_hipster (worth following on Twitter if you like @bigdatahipster and/or authentic, hand-crafted decision trees)
  • Data Science 101, a very nice blog which points to a multitude of resources
  • School of Data and the accompanying Data Wrangling Handbook
  • Agile Data by Russell Jurney (who is well worth following on Twitter and especially Quora). This book isn’t finished yet but can be viewed in its current state of development at the given link, which is within the Open Feedback Publishing System at O’Reilly Media. So you can, on one hand, read the book (or parts of it) for free before publication, and on the other hand, provide feedback and thus shape the contents of the book.
  • (edit 17/7 2012) Might as well throw this one in: Data Jujitsu: The Art of Turning Data into Product by DJ Patil, a free O’Reilly Radar report (epub/PDF/mobile).

Links without a common theme

  • Are we ready for a true data disaster? Interesting Infoworld article that talks about possibilities for devastating “data spills” that could have effects as bad as the oil spill, or worse.
  • Monkey Analytics – a “web based computation tool” that lets users run R, Python and Matlab commands in the cloud.
  • Blogs and tweets could predict the future. New Scientist article that mentions Google’s study from last year where they tried to use search data to predict various economic variables. A lot of organizations have seized upon that idea, and lately we have seen examples such as Recorded Future, a company that attempts to “mine the future” using future-related online text sources. Google famously used the “predictions from search data” idea to predict flu outbreaks. One of the interesting things here, I think, is that people’s searches (which could be viewed naïvely as ways to obtain data) actually become data in themselves; data that can be used as predictors in a statistical models. The Physics of Data is an interesting video where Google’s Marissa Mayer talks about this topic and a lot of other googly stuff (I don’t really get the name of the presentation though, despite her attempt to justify it in the beginning …).
  • Wikiposit aims to be a “Wikipedia of numerical data.” It aggregates thousands of public data sets (currently 110,000) into a single format and offers a simple API to access them. As of now, it only supports time series data, mostly from the financial domain.

Link roundup

Here are some interesting links from the past few weeks (or in some cases, months). I’m toying with the idea of just tweeting most of the links I find in the future and reserving the blog for more in-depth ruminations. We’ll see how it turns out. Anyway … here are some links!

Open Data

The collaborative filtering news site Reddit has introduced a new Open Data category.

Following the example of New York and San Francisco (among others), London will launch an open data platform, the London Data Store.

Personal informatics and medicine

Quantified Self has a growing (and open/editable) list of self-tracking and related resources. Notable among those is Personal Informatics, which itself tracks a number of resources – I like the term personal informatics and the site looks slick.

Nicholas Felton’s Annual Report 2009. “Each day in 2009, I asked every person with whom I had a meaningful encounter to submit a record of this meeting through an online survey. These reports form the heart of the 2009 Annual Report.” Amazing guy.

What can I do with my personal genome? A slide show by LaBlogga of Broader Perspectives.

David Ewing Duncan, “the experimental man“, has read Francis Collins’ new book about the future of personalized medicine (Language of Life: DNA and the Revolution in Personalized Medicine­) and written a rather lukewarm review about it.

Duncan himself is involved in a very cool experiment (again) – the company Cellular Dynamics International has promised to grow him some personalized heart cells. Say what? Well, basically, they are going to take blood cells from him, “re-program” them back to stem-cell like cells (induced pluripotent cells), and make those differentiate into heart cells. These will of course be a perfect genetic match for him.

Duncan has also put information about his SNPs (single-nucleotide polymorphisms; basically variable DNA positions that  differ from person to person) online for anyone to view, and promises to make 2010 the year when he tries to make sense of all the data, including SNP information, that he obtained about his body when he was writing his book Experimental Man. As he puts it, “Producing huge piles of DNA for less money is exciting, but it’s time to move to the next step: to discover what all of this means.”

HolGenTech – a smartphone based system for scanning barcodes of products and matching them to your genome (!) – that is, it can tell you to avoid some products if you have had a genome scan which found you have a genetic predisposition to react badly to certain substances. I don’t think that the marketing video done in a very responsible way (it says that the system: “makes all the optimal choices for your health and well being every time you shop for your genome“, but this is simply not true – we know too little about genomic risk factors to be able to make any kind of “optimal” choices), but I had to mention it.

The genome they use in the above presentation belongs to the journalist Boonsri Dickinson. Here are some interviews she recently did with Esther Dyson and Leroy Hood, on personalized medicine and systems biology, respectively, at the Personalized Medicine World Conference in January.

Online calculators for cancer outcome and general lifestyle advice. These are very much in the spirit of The Decision Tree blog, through which I in fact found these calculators.

Data mining

Microsoft has patented a system for “Personal Data Mining”. It is pretty heavy reading and I know too little about patents to able to tell how much this would actually prevent anyone from doing various types of recommendation systems and personal data mining tools in the future; probably not to any significant extent?

OKCupid has a fun analysis about various characteristics of profile pictures and how they correlate to online dating success. They mined over 7000 user profiles and associated images. Of course there are numerous caveats in the data interpretation and these are discussed in the comments; still good fun.

A microgaming network has tried to curb data mining of their poker data. Among other things, bulk downloading of hand histories will be made impossible.

Link roundup

Gearing up into Christmas mode, so no proper write-up for these (interesting) links.

Personalized medicine is about data, not (just) drugs. Written by Thomas Goetz of The Decision Tree for Huffington Post. The Decision tree also has a nice post about why self-tracking isn’t just for geeks.

A Billion Little Experiments (PDF link). An eloquent essay/report about “good” and “bad” patients and doctors, compliance, and access to your own health data.

Latent Semantic Indexing worked well for NetFlix, but not for dating. MIT Technology Review writes about how the algorithms used to match people at Match.com (based on latent semantic indexing / SVD) are close to worthless. A bit lightweight, but a fun read.

A podcast about data mining in the mobile world. Featuring Deborah Estrin and Tom Mitchell.  Mitchell just recently wrote an article in Science about how data mining is changing: Mining Our Reality (subscription needed). The take-home message (or one of them) is that data mining is becoming much more real-time oriented. Data are increasingly being analyzed on the fly and used to make quick decisions.

How Zeo, the sleep optimizer, actually works. I mentioned Zeo in a blog post in August.

Post Navigation