Follow the Data

A data driven blog

Archive for the category “Links”

Some resources

  • DataTau seems like a worthwhile Reddit-like site devoted to all things data.
  • Foundations of data science [PDF link]. A (quite complete) draft of a rather mathematically oriented book on data science. I haven’t had time to read it yet but it looks interesting. It seems to put quite a lot of emphasis on understanding the quirks of high-dimensional spaces.
  • Techniques to improve the accuracy of your predictive models. Useful (1h 20 min long) video of a presentation given by Phil Brierley at an R user group meeting in Melbourne.

Three angles on crowd science

Some recently announced news that illuminate crowd science, advancing science by somehow leveraging a community, from three different angles.

  • The Harvard Clinical and Translational Science Center (or Harvard Catalyst) has “launched a pilot service through which researchers at the university can submit computational problems in areas such as genomics, proteomics, radiology, pathology, and epidemiology” via the TopCoder online competitive community for software development and digital creation. One recently started Harvard Catalyst challenge is called FitnessEstimator. The aim of the project is to “use next-generation sequencing data to determine the abundance of specific DNA sequences at multiple time points in order to determine the fitness of specific sequences in the presence of selective pressure. As an example, the project abstract notes that such an approach might be used to measure how certain bacterial sequences become enriched or depleted in the presence of antibiotics.” (the quotes are from a GenomeWeb article that is behind a paywall) I think it’s very interesting to use online software development contests for scientific purposes, as a very useful complement to Kaggle competitions, where the focus is more on data analysis. Sometimes, really good code is important too!
  • This press release describes the idea of connectomics (which is very big in neuroscience circles now) and how the connectomics researcher Sebastian Seung and colleagues have developed a new online game, EyeWire, where players trace neural branches “through images of mouse brain scans by playing a simple online game, helping the computer to color a neuron as if the images were part of a three-dimensional coloring book.” The images are actual data from the lab of professor Winfried Denk. “Humans collectively spend 600 years each day playing Angry Birds. We harness this love of gaming for connectome analysis,” says Prof. Seung in the press release. (For similar online games that benefit research, see e.g. Phylo, FoldIt and EteRNA.)
  • Wisdom of Crowds for Robust Gene Network Inference is a newly published paper in Nature Methods, where the authors looked at a kind of community ensemble prediction method. Let’s back-track a bit. The Dialogue on Reverse Engineering Assessment and Methods (DREAM) initiative is a yearly challenge where contestants try to reverse engineer various kinds of biological networks and/or predict the output of some or all nodes in the network under various conditions. (If it sounds too abstract, go to the link above and check out what the actual challenges have been like.) The DREAM initiative is a nice way to check the performance of the currently touted methods in an unbiased way. In the Nature Methods paper, the authors show that “no single inference method performs optimally across all data sets. In contrast, integration of predictions from multiple inference methods shows robust and high performance across diverse data sets” and that “Our results establish community-based methods as a powerful and robust tool for the inference of transcriptional gene regulatory networks.” So, in a very wisdom-of-crowds manner (as indeed the paper title suggests), it’s better to combine the predictions of all the contestants than just use the best ones. It’s like taking a composite prediction of all Kaggle competitors in a certain contest and observing that this composite prediction was superior to all individual teams’ predictions. I’m sure Kaggle has already done this kind of experiment, does anyone know?

Snippets

Some interesting nuggets gleaned from the web, without much editorial comment :-)

Quick links

Links without a common theme

  • Are we ready for a true data disaster? Interesting Infoworld article that talks about possibilities for devastating “data spills” that could have effects as bad as the oil spill, or worse.
  • Monkey Analytics - a “web based computation tool” that lets users run R, Python and Matlab commands in the cloud.
  • Blogs and tweets could predict the future. New Scientist article that mentions Google’s study from last year where they tried to use search data to predict various economic variables. A lot of organizations have seized upon that idea, and lately we have seen examples such as Recorded Future, a company that attempts to “mine the future” using future-related online text sources. Google famously used the “predictions from search data” idea to predict flu outbreaks. One of the interesting things here, I think, is that people’s searches (which could be viewed naïvely as ways to obtain data) actually become data in themselves; data that can be used as predictors in a statistical models. The Physics of Data is an interesting video where Google’s Marissa Mayer talks about this topic and a lot of other googly stuff (I don’t really get the name of the presentation though, despite her attempt to justify it in the beginning …).
  • Wikiposit aims to be a “Wikipedia of numerical data.” It aggregates thousands of public data sets (currently 110,000) into a single format and offers a simple API to access them. As of now, it only supports time series data, mostly from the financial domain.

Link roundup

Here are some interesting links from the past few weeks (or in some cases, months). I’m toying with the idea of just tweeting most of the links I find in the future and reserving the blog for more in-depth ruminations. We’ll see how it turns out. Anyway … here are some links!

Open Data

The collaborative filtering news site Reddit has introduced a new Open Data category.

Following the example of New York and San Francisco (among others), London will launch an open data platform, the London Data Store.

Personal informatics and medicine

Quantified Self has a growing (and open/editable) list of self-tracking and related resources. Notable among those is Personal Informatics, which itself tracks a number of resources – I like the term personal informatics and the site looks slick.

Nicholas Felton’s Annual Report 2009. “Each day in 2009, I asked every person with whom I had a meaningful encounter to submit a record of this meeting through an online survey. These reports form the heart of the 2009 Annual Report.” Amazing guy.

What can I do with my personal genome? A slide show by LaBlogga of Broader Perspectives.

David Ewing Duncan, “the experimental man“, has read Francis Collins’ new book about the future of personalized medicine (Language of Life: DNA and the Revolution in Personalized Medicine­) and written a rather lukewarm review about it.

Duncan himself is involved in a very cool experiment (again) – the company Cellular Dynamics International has promised to grow him some personalized heart cells. Say what? Well, basically, they are going to take blood cells from him, “re-program” them back to stem-cell like cells (induced pluripotent cells), and make those differentiate into heart cells. These will of course be a perfect genetic match for him.

Duncan has also put information about his SNPs (single-nucleotide polymorphisms; basically variable DNA positions that  differ from person to person) online for anyone to view, and promises to make 2010 the year when he tries to make sense of all the data, including SNP information, that he obtained about his body when he was writing his book Experimental Man. As he puts it, “Producing huge piles of DNA for less money is exciting, but it’s time to move to the next step: to discover what all of this means.”

HolGenTech – a smartphone based system for scanning barcodes of products and matching them to your genome (!) – that is, it can tell you to avoid some products if you have had a genome scan which found you have a genetic predisposition to react badly to certain substances. I don’t think that the marketing video done in a very responsible way (it says that the system: “makes all the optimal choices for your health and well being every time you shop for your genome“, but this is simply not true – we know too little about genomic risk factors to be able to make any kind of “optimal” choices), but I had to mention it.

The genome they use in the above presentation belongs to the journalist Boonsri Dickinson. Here are some interviews she recently did with Esther Dyson and Leroy Hood, on personalized medicine and systems biology, respectively, at the Personalized Medicine World Conference in January.

Online calculators for cancer outcome and general lifestyle advice. These are very much in the spirit of The Decision Tree blog, through which I in fact found these calculators.

Data mining

Microsoft has patented a system for “Personal Data Mining”. It is pretty heavy reading and I know too little about patents to able to tell how much this would actually prevent anyone from doing various types of recommendation systems and personal data mining tools in the future; probably not to any significant extent?

OKCupid has a fun analysis about various characteristics of profile pictures and how they correlate to online dating success. They mined over 7000 user profiles and associated images. Of course there are numerous caveats in the data interpretation and these are discussed in the comments; still good fun.

A microgaming network has tried to curb data mining of their poker data. Among other things, bulk downloading of hand histories will be made impossible.

Link roundup

Gearing up into Christmas mode, so no proper write-up for these (interesting) links.

Personalized medicine is about data, not (just) drugs. Written by Thomas Goetz of The Decision Tree for Huffington Post. The Decision tree also has a nice post about why self-tracking isn’t just for geeks.

A Billion Little Experiments (PDF link). An eloquent essay/report about “good” and “bad” patients and doctors, compliance, and access to your own health data.

Latent Semantic Indexing worked well for NetFlix, but not for dating. MIT Technology Review writes about how the algorithms used to match people at Match.com (based on latent semantic indexing / SVD) are close to worthless. A bit lightweight, but a fun read.

A podcast about data mining in the mobile world. Featuring Deborah Estrin and Tom Mitchell.  Mitchell just recently wrote an article in Science about how data mining is changing: Mining Our Reality (subscription needed). The take-home message (or one of them) is that data mining is becoming much more real-time oriented. Data are increasingly being analyzed on the fly and used to make quick decisions.

How Zeo, the sleep optimizer, actually works. I mentioned Zeo in a blog post in August.

Link roundup

A roundup of some interesting links from the past few weeks.
Brian Mossop of the Decision Tree blog is embarking on a project to find out how much personal data is needed to stay healthy. He will use devices like the Zeo sleep coach and the Nike+ sportband to record his personal data and post updates about what he has found. He’s also promised a longer blog post after 30 days summarizing his experiences.

dnaSnips is a site that compares reports received by the same person from three different direct-to-consumer (DTC) genetic analysis services:  23andme, deCODEme and Navigenics. Summarizing the experiment, the author feels that all three services give pretty accurate results. (link found via Anthony Fejes’ blog)

By now, there are probably few people who haven’t heard about the scientist who turned out to be call girl blogger Belle de Jour. I was intrigued to find that her Amazon wishlist contains hardcore statistical data analysis books like Chris Bishop’s Neural Networks for Pattern Recognition and An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. In fact, she’s previously blogged about her Bayesian theory of relationships, so she’s clearly no slouch when it comes to statistics and data mining.

Algorithms in healthcare

The Medical Quack has made many posts about the use of algorithms (which, of course, go hand in hand with data) in healthcare over the past months. Here’s links to a few of them:

Staying Alive – Dr. Craig Freied Co-Founder of (Azyxxi) Microsoft Amalga And Hospital Safety (ABC 20/20)
(about the “data junkie” behind Microsoft’s Amalga system)

The Algorithm Report: HealthCare and Other Related Instances

Brave New Films – “United Wealth Care” – It’s In the Algorithms

Are We Ever Going to Get Some Algorithm Centric Laws Passed for Healthcare!

Health Care Insurers Suggest Algorithms and Business Intelligence solutions to provide health insurance solution

It gets lonely in the data mines

From the rather interesting BERG blog, a long and thoughtful post on exploring data using (programming) code in a process called material exploration. In this case, the data exploration is done in the context of developing a new information system called Ashdown, which is related to the British education system. The author of the blog post argues that in data-exploration projects, code becomes “... a sculpting tool, rather than an engineering material“.

BERG has done material explorations before – they were a big part of our Nokia Personalisation project, for instance – and the value of them is fairly immediate when the materials involved are things you can touch.

But Ashdown is a software project for the web – its substrate is data. What’s the value of a material exploration with an immaterial substrate? What does it look like to perform such explorations?

There are problems and risks in data exploration. Of course, there is the exploration vs. exploitation dilemma, which turns up in e.g. reinforcement learning – how should one balance the desire to explore every nook and cranny of the data space versus the ability to recognize a good-enough strategy and put all effort into fine-tuning it?

The author hints at risks related to becoming overwhelmed, even possessed by the data:

[The] dataset – its meaning, its structure – gets stuck in your head, and it’s easy to lose yourself to it. That often makes it harder to explain to others – you start talking in a different langauge – so it becomes critical to get it out of your head and onto screens.

It also feels lonely in the data-mines at times. Not because you’re the only person working on it, but because no-one else can speak the language you do; the deeper you get into the data, the harder you have to work to communicate it, and the quicker you forget how little anyone else on the project knows.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 119 other followers