Follow the Data

A data driven blog

Archive for the month “September, 2009”

Good issue of H+

I’ve been casually following h+, the latest magazine by R U Sirius, the man behind the Mondo 2000 and many other subsequent publications, but until the latest issue (#4) I’ve always felt it’s been a bit too … let’s say transhumanist for my tastes. This issue, though, is quite nice. There are articles about open source medicine, how it feels to have a new sense (perfect sense of direction, in the form of a device called Northpaw), augmented reality on cell phones (Layar etc.), eliminating suffering through brain science and genetic engineering, and much more.

Perhaps the most interesting article from the point of view of this blog is called Open prediction – How sports fans can help save the world. It’s about On the Record Sports, a sports prediction site where sports fans make open predictions, visible to all. The idea is that by aggregating predictions from lots of users (you could call it crowdsourced predictions), you are likely to get a better prediction than if you had asked a single, though knowledgeable, person. However, since the predictions are all open, you can also track your own (and others’) prediction performance and relate it to everyone else’s.

The fact that large groups of people tend to to well when their guesses are combined is well-known and has been discussed at length in, for example, Ian Ayres’ book Supercrunchers. Even aggregations of different prediction algorithms (such as in the machine learning techniques bagging and boosting) usually work well – as evidenced by the recently completed NetFlix competition – presumably because algorithms (like people) make prediction errors due to slightly different biases, which can be smoothed out in a combined approach.

On The Record Sports may therefore soon be sitting on a very interesting database of aggregated predictions. (At what point will they just go to the bookmaker and bet the farm using the crowdsourced predictions?) The h+ article mentions that the company has a pending patent related to the notion of prediction as entertainment, which sounds intriguing. Of course, prediction can be a sport in itself, as the article points out.

All in all, worth checking out.


Self-tracking news

FitBit have started to ship their clip-on device for tracking the amount of calories burnt, steps taken and distance travelled, as well as sleep quality. As explained here, the FitBit contains an accelerometer (many new phones have that as well, but they are bulkier than the FitBit) which has custom algorithms, trained using “ground truth” measurements like breath gas composition, that can accurately estimate calorie consumption for different kinds of movements like walking to the kitchen, jogging or dashing to the bus.

Now the only thing left is to measure how many calories that go into your system. Luckily, DailyBurn have just released FoodScanner, an iPhone app that lets you scan barcodes on the food you buy. It’s currently on sale for just 99 cents (!), but unfortunately in might only be available in the U.S. version of AppleStore (at least the Swedish one doesn’t have it as I write this). The information you scan can be uploaded to DailyBurn’s web site and added to a growing food database. Seems pretty cool.

Another recent release (Sept. 17) was a book called Total Recall – How the E-memory Revolution Will Change Everything, by Gordon Bell and Jim Gemmell, who are involved with Microsoft Research’s MyLifeBits project. Bell has been trying to record as much of his life as possible – including photos, letters and phone calls – since 1998. A quote from the blurb, where I find the part in italics particularly interesting:

We are capturing so much of our lives now, be it on the date–and location–stamped photos we take with our smart phones or in the continuous records we have of our emails, instant messages, and tweets–not to mention the GPS tracking of our movements many cars and smart phones do automatically. We are storing what we capture either out there in the “cloud” of services such as Facebook or on our very own increasingly massive and cheap hard drives. But the critical technology, and perhaps least understood, is our magical new ability to find the information we want in the mountain of data that is our past. And not just Google it, but data mine it so that, say, we can chart how much exercise we have been doing in the last four weeks in comparison with what we did four years ago. In health, education, work life, and our personal lives, the Total Recall revolution is going to change everything.

Open access but not open data?

Through Mailund on the Internet, I found a depressing report in PLoS ONE: Empirical Study of Data Sharing by Authors Publishing in PLoS Journals. The authors wanted to check whether researchers who publish in Public Library of Science (PLoS) journals, which are devoted to the principle of open access to scientific articles and data, really abide by those principles themselves.

Thus, they selected ten papers published in either of two PLoS journals (PLoS Medicine and PLoS Clinical Trials), and contacted the investigators behind them to ask for the data that the study was based on. Although it is an explicit requirement from PLoS that data from all studies must be shared, only one (!) of the investigators sent the data to the authors. Four investigators responded and refused to share the data even after they had been reminded of PLoS policy, three didn’t respond, two email addresses were invalid and one investigator requested further details.

Why this miserable reply rate, which could be expected from scientists in general, but not from people who publish in PLoS? Is the reason that those who publish in PLoS Medicine and PLoS Clinical Trials just see it as a way to get their research out there without bothering to keep their part of the open-access deal?

Or could it be a case of poor attitude towards students? The authors’ stated reason for requesting the data, in all cases, was “out of personal interest in the topic and the need for original data for master’s level coursework.” Maybe the important scientists behind the articles felt that they did not have to reply to mere master’s students?

I wonder how breaches of this kind should be handled. Should PLoS (or several different publishers) keep a “black list of data non-sharers”? It is of course difficult to go after this kind of thing.

At a minimum, the scientists who did not reply should be forced to read the recent Nature Special on data sharing!

Crowdsourcing dinosaur science

The recently initiated Open Dinosaur project is an excellent example of crowdsourcing in science. The people behind the project are enlisting volunteers to find skeletal measurements from dinosaurs in published articles and submit them into a common database. Or as they put it, “Essentially, we aim to construct a giant spreadsheet with as many measurements for ornithischian dinosaur limb bones as possible.” All contributors (anyone can participate) get to be co-authors on the paper that will be submitted at the end of the project.

One good thing about the project is that its originators have obviously taken pains to help the participants get going. They’ve put up comprehensive tutorials about dinosaur bone structure (!) and about how to locate relevant references and find the correct information in them.

As of yesterday, they had over 300 verified entries, after just ten days. It will be interesting to see other similar efforts in the future.

Personal transcriptomics?

MIT’s Technology Review has an interesting blog post about Hugh Rienhoff, a clinical geneticist and entrepreneur, who is trying to apply personal genomics transcriptomics to find the causes of his daughter Beatrice’s unusual, Marfan’s syndrome-like symptoms. The blog post describes how Illumina (a leading company in DNA sequencing) has sequenced parts of the genomes of Rienhoff, his wife and his daughter, and how he has now spent about a year searching through these genome sequences for mutations that only Beatrice has.

In fact, looking at another blog post, it seems like they are actually sequencing RNA (mRNA, to be specific) rather than genomic DNA. This makes a lot of sense, because RNA sequencing (RNA-seq) gives information about genes that are actually being expressed – transcribed into mRNA and then presumably translated to proteins. This sort of “transcriptome profiling” should potentially be able to give a lot of information about disease states beyond what can be gleaned from a genome scan (although those are, of course, informative as well.)

From the sequencing data, Rienhoff has compiled a list of about 80 genes that are “less active” in Beatrice than in her parents. (I wonder what tissues or cell types they assayed?) According to the Nature blog post, Illumina will be doing similar transcriptome profiling on up to nine family trios (mum, dad, child) where the child has, for instance, autism or Loyes Dietz syndrome.

A quote from the Technology Review blog post:

One of the biggest challenges, Rienhoff says, is the software available to analyze the data. “To ask the questions I want to ask would take an army,” he says. “I’m trying to connect the dots between being a genomicist and a clinical geneticist. I don’t think anyone here realizes how difficult that is. I’m willing to take it on because it matters to me.”

Reading about this sort of literally personal genomics/medicine made me think of Jay Tenenbaum and his company CollabRx, which offers a “Personalized Oncology Research Program”, where they “…use state-of-the-art molecular and computational methods to profile the tumor and to identify potential treatments among thousands of approved and investigational drugs” So the approach here is presumably also to do some sort of individual-based transcriptional profiling, but this time on tumor material. After all, cancer is a heterogeneous disease (or a heterogeneous set of diseases) and tumors probably vary widely between patients. Echoing Rienhoff above, Tenenbaum said in an interesting interview a couple of months ago that biology is becoming an information science and that CollabRx are “heavily dependent on systems and computational biology” (=software, algorithms, data analysis, computing infrastructure).

I applaud the effforts of CollabRx, while simultaneously being sceptical about what can be achieved today using this approach in the way of clinical outcomes. But someone has to be the visionary and pave the way.

Stock predictions on Twitter

In a gutsy move, a company called BAMInvestor has started to broadcast its stock market predictions through its Twitter account. As I am writing this, the latest prediction is that crude oil stocks will plunge at the end of September. The BAM in BAMInvestor stands for “Behavioral Analysis of Markets”, which is a method that “…predicts future price movements in human traded markets through the study of market participants’ emotional responses during periods of high emotion and ‘capitulation.’”, according to the web page. A bit further down, there is a sentence that sends my BS detector into overdrive, “The “guts” of the model are based on proprietary computations, the components of which include elements of—but are not exclusive to—the Fibonacci sequence and its golden ratio, fractal studies, and several unique capitulation thresholds.“, but I’ll try to keep an open mind. (Any time fractals are mentioned, I tend to become very skeptical, and with Fibonacci numbers thrown into the mix, it’s starting to sound downright Dan Brown-ish.)

Taxi driving and data mining

Tim O’Reilly’s and John Battelles’s Web Squared essay from a couple of months back relates an interesting anecdote:

Radar blogger Nat Torkington tells the story of a taxi driver he met in Wellington, NZ, who kept logs of six weeks of pickups (GPS, weather, passenger, and three other variables), fed them into his computer, and did some analysis to figure out where he should be at any given point in the day to maximize his take. As a result, he’s making a very nice living with much less work than other taxi drivers. Instrumenting the world pays off.

I think this kind of thing could be applied in many different professions. It would be interesting to know how well-versed the taxi driver was in statistics. If he wasn’t statistically trained, he presumably used a simple common-sense model, the success of which suggests that large gains can be had simply by quantifying what you do and picking up the major trends. Of course, he may have been a real data-analysis ninja. Either way, it’s probably fair to say, as O’Reilly and Battelle do in their article, that “Data analysis, visualization, and other techniques for seeing patterns in data are going to be an increasingly valuable skillset. Employers take notice.” Experienced taxi drivers  have probably built up an equally effective implicit model of how to get the most income, but the Wellington taxi driver may have been able to “skip ahead” a couple of years using his statistics.

Another thought that occurred to me is how one would go about building a generic web-based tool where people can track everyday data with a view towards prediction. It would likely be a combination of something like your.flowingdata for the tracking and predict.i2pi for the simple, no-fuss prediction part. Maybe such an application already exists?

The user would of course still have to put some work into defining the problem properly, like deciding what to record and how to encode it. For instance, the taxi driver mentioned above would have had to think about whether to record his location in terms of, for example, neighbourhoods, streets or exact GPS location (or all three) – each likely giving rise to its own advantages and drawbacks.

A really useful general tracking/prediction tool would probably also need some sort of automatic model optimization and validation framework (e.g. built-in variable selection and cross-validation cycles), which would be mostly kept out of the user’s view (unless the user explicitly wants to see it).

Everything is contagious

I’ve been putting off writing a lengthy blog post on this topic for a while, but today I found that both the New York Times and Wired have new articles out on the same subject (see below), so I might as well point to them while at the same time offering some of my half-baked thoughts.

A couple of weeks back, I was listening to a podcast from the SmartData Collective podcast series where a guy named Korhan Yunak talked about predicting if and when a customer would cancel their mobile phone subscription and switch to another provider. All kinds of demographic information, behavioural data and other things have been used to try to extract features that predict such switches. Yunak explained that recent research had found that essentially, subscription switches propagate through social networks. What does that mean exactly?

Phone companies can construct a customer network by collecting “connections” between customers (for example, by linking everyone that has called or texted each other). By simply looking at a customer’s network neighborhood – their direct connections (often friends) and perhaps the friends of friends – the companies can get a huge boost in their predictive accuracy  (I’ve forgotten the exact number and metric, but it was a major improvement) .

Now, it is not surprising in itself that people talk to each other and influence each other in different ways, but it was surprising to me that the effect was so strong. It made me think of earlier published work which showed that obesity, happiness and smoking are all “socially contagious” in the sense that they seem to spread through social networks.

As I mentioned above, there is a new Wired article by Jonah Lehrer which talks about these things and has nice visualizations of them as well. There’s also a New York Times article on the same theme by Clive Thompson, but I haven’t read it because of the paywall.

These findings, of course, suggest a new kind of “network marketing” (the “old kind” also goes by the name of multi-level marketing). The idea is that you can use information about a customer’s friends’ preferences and shopping behavior to construct more precise targeted ads and other marketing strategies. Companies based around such ideas include Media6°, which “…connects a brand’s existing customers with user segments composed entirely of consumers who are interwoven via the social graph.” Another company, 33Across, “…uses previously untapped social data sources, in combination with advanced social network algorithms, to create unique and scalable audience segments.” Both companies do this by capturing data from social network sites on the web, according to this article.

Launch script for clojure

Yesterday I was trying to get the clojure web framework compojure running, but I had some troubles. It seems that the version of clojure I was running wasn’t compatible with the latest and greatest compojure off github, or that I hadn’t included all jars needed. I found a tip on one of the comments here for specifying a folder where all jars are included which takes the pain out of adding each jar singlehandedly. So I have now adapted my script to start clojure which was taken from, but changed so that it includes a full folder’s worth of jars instead of specifying each jar by itself.

Here’s my new launch script for clojure:


# Add extra folders as specified by `.clojure_libs` file
if [ -f .clojure_libs ]
EXT_DIRS=$EXT_DIRS:`cat .clojure_libs`

if [ -z "$1" ]; then
$JAVA -server -Djava.ext.dirs=$EXT_DIRS \
jline.ConsoleRunner clojure.lang.Repl
$JAVA -server -Djava.ext.dirs=$EXT_DIRS  clojure.lang.Script $scriptname -- $*

Now that I have a working base default setup of compojure it is time to start writing something useful! Will need to make slime work without hiccups too.

Personal genome glitch uncovered

As recounted in this New Scientist article and commented upon in Bio-IT World, journalist Peter Aldhous managed to uncover a bug in the deCODEme browser (Decode Genetics’ online tool for viewing parts of your own genome). deCODEme is one of a handful of services, including 23andme and Navigenics, that genotype small genetic variations called SNPs (snips; single-nucleotide polymorphisms) in DNA samples submitted by customers. The results are then used to calculate disease risks and other things, which are displayed to the customer in a personalized view of his or her genome.

Aldhous was comparing the output he got from two of these services – deCODEme and 23andme  – and discovered that they were sometimes very different. After patiently going to the bottom of the matter, he discovered that the reason for the discrepancy was that the deCODEme browser sometimes (but not always) displayed jumbled output for mitochondrial sequences. According to Bio-IT World, the bug seems to have been due to an inconsistency between 32-bit and 64-bit computing environments and has now been fixed.

Isn’t this a nice example of computational journalism, where a journalist is skilled or persistent enough to actually analyze the data that is being served up and detect inconsistencies?

I might as well sneak in another New Scientist article about personal genomes. This one urges you to make your genome public in the name of the public good. It mentions the Harvard Personal Genome Project, which aims to enroll 100,000 (!!) participants whose genomes will be sequenced. The first ten participants, some of which are pretty famous, have agreed to share their DNA sequence freely.

I have no idea whether the Personal Genome Project is related to the Coriell Personalized Medicine Collaborative which also wants to enroll 100,000 participants in a longitudinal study where the goal is to find out how much utility there is in using  personal genome information in health management and clinical decision-making

Post Navigation