Follow the Data

A data driven blog

Archive for the category “Web sites”

Biology-inspired algorithm design and non-obvious news discovery

There is a new Science article which seems really cool, although I haven’t had time to get past the paywall yet. The title is “A Biological Solution to a Fundamental Distributed Computing Problem” and the gist of it is pretty simple: a research group has found that an important procedure in distributed computing, “maximal independent set selection”, has been solved in a simple and efficient way in a kind of fly’s nervous system development. An algorithm based on the process that occurs in the fly’s immature nervous system can be directly applied to a network of sensors, for example.

In other news, Bradford Cross, who started the data-driven flight-delay prediction company FlightCaster, is starting a new company called Woven. It will be about discovering news you are interested in, and the platform will explicitly consider a conundrum that I’ve often been thinking about, which is the following (and possibly mentioned in some earlier blog post): Do you really want to read news that are always perfectly tailored to your interests? Wouldn’t this cause you to miss a lot of interesting information that you get from e.g. browsing the newspaper and “accidentally” reading about things you didn’t know about but which are actually kind of interesting? Bradford Cross mentions this in a recent interview and says that he started to “miss the serendipity that a newspaper provides”.  So far so good, but how to actually implement this kind of quasi-random content exposure (I tend to think of it as a kind of beneficial noise) into a news discovery service? I guess we will soon see what Woven has in mind.

Finally, the PayPal Developer Network (!) has a pretty nice tutorial about analyzing and visualizing the recently released World Bank data using tools like Java servlets, Google Charts and MySQL. The World Bank data would easily deserve a verbose blog post of its own (and I was planning one several months ago) but that will have to wait until I’ve taken a proper look at it.

Identifying migraine triggers and “your genome has a posse”

MyMigraineJournal is an interesting self-tracking site with some statistical weight behind it. The idea is to let migraine sufferers define potential “triggers” of migraines, like red wine or aged cheese, after which they will complete a daily questionnaire about what they ate, drank etc. and whether they had a headache that day (this takes about 3o seconds per day). The site will also try to assess whether any of the triggers seems statistically related to higher (or lower) migraine risk. As outlined here, this is done through logistic regression on one variable (trigger) at a time. The site uses a hierarchical Bayesian model where the prior distribution is initially uniform but will eventually, after enough data has been collected, be derived from the aggregated population of previous users, which I think is a nice touch. They don’t look for interactions between triggers yet, but may add such functionally in the future. A user can download their own complete data in Excel format, or delete some or all of it from the system. I think simple but clever systems like this could prove quite useful to people.

On a related note, Nature Medicine recently ran an interesting article, “Personalized investigation“, about people who use direct-to-consumer genetic tests to learn more about genomes and physiology. The article describes how five early adopters of 23andMe’s SNP tests teamed up to investigate whether SNPs in a gene coding for an enzyme related to vitamin B metabolism were predictive of how the carriers would respond to vitamin supplements. The team then performed a series of experiments where they either took no supplements, took multivitamins, L-methylfolate or a combination of both multivitamins and L-methylfolate. After each phase of the experiment, they took blood tests to measure homocysteine, a biomarker for vitamin B activity.

Now, unless I’m misguided or the news is misreported, such an experiment with five subjects could never get anywhere close to a statistically significant outcome. But with larger cohorts, that could change. A new company called Genomera is developing tools that will allow this kind of self-experimentation study to scale into large numbers of participants. In fact, the Nature Medicine article says that Genomera will “roll out the vitamin study as the first open participatory project under its platform.” As of now, the Genomera site still appears to be mostly under construction, although it does say that the company has trademarked “Your genome has a posse” and related phrases. It sounds like an interesting business concept – I just wish they hadn’t described it as “the Facebook of genomics” in the Nature Medicine article …

MetaOptimize Q+A

The MetaOptimize site has a nice StackOverflow-style question & answer community dedicated to machine learning, data mining, natural language processing and the like. It seems to have gotten off to a nice start. Here, you can enquire about things like the best freely available machine learning textbooks or how to set up Hadoop on your office machine, or more technical details such as whether subsampling biases the ROC-AUC score.

Cool data > big data?

From a 2006 post on Seth Roberts’ blog:

One day on the track I met a professor who had recently gotten tenure. He had only published three articles (maybe he had 700 in the pipeline), so his getting tenure surprised me. I asked him: What’s the secret? What was so great about those three papers? His answer was two words: “Cool data.”


I’m a big believer in cool data. The design goal is: How far can we possibly push it so that it makes it a vivid point? Most academics push it just far enough to get it published. I try to push it beyond that to make it much more vivid. That’s what [Stanley] Milgram did with his experiments. First, he showed obedience to authority in the lab. Then he stripped away a whole lot of things to show how extreme it was. He took away lab coats, the college campus. That’s what made it so powerful.

Machine learning competitions and algorithm comparisons

Tomorrow, 29 May 2010, a lot of (European) people will be watching the Eurovision Song Contest to see which country will take home the prize. Personally, I don’t really care about who wins the contest itself, but I do care (somewhat) about which predictor will win the Eurovision Voting Forecast competition arranged by Kaggle describes itself as “a platform for data mining, bioinformatics and forecasting competitions“. It provides an objective framework for comparing techniques and “allows organizations to have their data scrutinized by the world’s best statisticians.”

Contests like this are fun, but they can also have more serious aims. For instance, Kaggle also hosts a competition about predicting HIV progression based on the virus’ DNA sequence. The currently leading submission has already improved on the best methods reported in literature, and so a post at Kaggle’s No Free Hunch blog asks whether competitions might be the future of research. I think they may well be, at least in some domains. A few months back, I mentioned an interesting challenge at Innocentive which is essentially a very difficult pure research problem, and it will be interesting to learn how the winning team there did it (if any details are disclosed). (I signed up for this competition myself, but haven’t been able to devote more than one or two hours to it so far, unfortunately.)

There are other platforms for prediction competitions as well, for instance TunedITs challenge platform, which allows university teachers to “make their courses more attractive and valuable, through organization of on-line student competitions instead of traditional assignments.” TunedIT also has a research platform where you can run automated tests on machine learning algorithms and get reproducible experimental results. You can also benchmark results against a knowledge base or contribute to and use a repository of various data sets and algorithms.

Another initiative for serious evaluation of machine learning algorithms in various problem domains is MLcomp. Here, you can upload your own datasets (or use pre-loaded ones) and run existing algorithms on them through a web interface. MLcomp then reports various metrics that allow you to compare different methods.

By the way, 22 teams participated in Kaggle’s Eurovision challenge, and Azerbaijan is the clear favorite, having been picked as the winner by 14 teams. Let’s see how it goes tomorrow.


To follow up on yesterday’s post about data sources on the web, I’d like to mention an interesting resource, predict.i2pi, which automatically builds predictive models based on data that you upload. Using it could hardly be simpler – you just have to prepare a comma-separated text file with attributes (predictor variables) and one or more  target values (response variables), with the latter being identified as such by putting a star (*) in front of the variable name in the header row. The system will then match your particular data file to a set of suitable prediction algorithms (for example, regression models rather than classification models for a continuous response variable), evaluate the performance of these algorithms on a hold-out set from your data, and output the best results. As the site itself puts it,

Our team of elves will work on your file, running it against a range of model types and keeping track of the best ones. Every now and then we will update your page indicating the best models to date.

There’s also an API for predict.i2pi, and developers of statistical learning methods are encouraged to integrate their own favourite algorithms into the system. Read this blog post for more details.

For in-depth background on the various statistical learning and machine learning algorithms, you could do worse than to check out the lectures at There’s really an astounding amount of information there about lots of different fields, but in particular computer science, with a skew towards machine learning.

Data sources on the web

So where are all these huge data sets that I (and others) have been talking about? Well, some of them are freely available for download. For example, the extensive Reality Mining data set from MIT (which I have blogged about) is available as a mySQL database for anyone to play around with.

There are a couple of repositories for data sets. Infochimps has hundreds or probably thousands of data sets from a wide variety of sources. Some of the data is directly downloadable from the site, while other data sets are just pointed to. Datamob is a similar, though smaller, resource. Amazon’s Public Data Sets are meant to be used seamlessly from within Amazon’s cloud computing applications, like the Elastic Compute Clusters (EC2). Here, we find massive datasets such as the collection of all publicly available DNA sequences from GenBank.

Peter Skomoroch has a tag for datasets which is probably the most extensive reference for big downloadable data out there (and which makes this blog post rather superfluous …) Due to the magic of, this list is of course dynamic and continuosly growing.

Finally, programmableweb is perhaps not strictly about data per se, but provides links to known APIs for access to web-based resources through your own programs.

Patient social networks

Online social networks for patients would seem to be rife with potential for medical discovery. Given a critical mass of patients who are communicating with each other and with a (morally responsible) service provider, there should be opportunities for e.g. relating disease progression to lifestyle, demographics, etc.,  for discovering unexpected relationships between different diseases, and perhaps for enabling a more precise categorization of diseases. Of course, patients would also be highly motivated to help in the development of new drugs for their particular disease, although it’s not clear how this motivation should best be leveraged.

CureTogether is an interesting effort in this direction. It describes itself using the term Open Source Health Research and aims to enable patients to learn from their peers and to get personalized health information. CureTogether also allows patients (or anyone, really) to participate in and even fund research directed toward a specific disease.

In addition, CureTogether has released a couple of “crowdsourced” books which I think are pretty interesting. Migraine Heroes contains stories and data from 271 migraine patients who describe, in their own words, symptoms, side effects of treatments and so on. The information collected for the book also suggested novel and surprising co-morbid conditions (secondary diseases or disorders in addition to migraine). Endometriosis Heroes is another book in the same vein.

Patients Like Me is another online social network for patients. It was pretty extensively discussed in an Interviews with Innovators podcast, from which I have borrowed a couple of quotes. Basically, Patients Like Me allows patients to share data about their diseases, the drugs they are taking, side effects from treatments and so on. The company makes money by selling this data (in anonymized form) to drug companies, which hopefully leads to a win-win situation where the drug companies get relevant information and the patients get better drugs. The company “puts patients in direct contact with the companies who can help them“, as Jamie Heywood put it in the podcast linked above.

Patients Like Me provides “structured data for illnesses” – a common format for describing different illnesses in a similar way. This is of course crucial for computer readability and efficient statistical analysis. The company sees itself as “training a group of expert consumers to do evidence based decisions” through their service.

An interesting tidbit from the company’s web site:

Future state modeling – Simply “tracking” a patient’s progression has never been the goal for us; we’ve always wanted to take past information and use it to predict the future state of an individual patient. In relatively linear diseases like ALS, that means we can help patients to plan in advance for when they might need a wheelchair or other equipment. It’s often the case that ALS/MND patients don’t get the equipment they need until several months after they could have benefited from having it. Such a tool would give a customized prediction for the individual patient. After all, most of us don’t want to know about the “average” patient, we want to know about a “patient like me”!

Track your habits with Twitter

your.flowingdata is a new service that lets you track your daily habits (which might be any kind of habits – eating, working out, watching TV …) by sending messages to a dedicated Twitter channel. There is a simple syntax for describing what you have been up to. A nifty idea in all its simplicity.

So why track your habits? I’ll borrow this explanation from a Wired article (which, by the way, is worth reading in its entirety, although it perhaps gives too much credit to Nike, as argued in the article comments):

Using a flood of new tools and technologies, each of us now has the ability to easily collect granular information about our lives—what we eat, how much we sleep, when our mood changes.

And not only can we collect that data, we can analyze it as well, looking for patterns, information that might help us change both the quality and the length of our lives. We can live longer and better by applying, on a personal scale, the same quantitative mindset that powers Google and medical research. Call it Living by Numbers—the ability to gather and analyze data about yourself, setting up a feedback loop that we can use to upgrade our lives, from better health to better habits to better performance.

Why not start by tracking your happiness?

Ask a stranger

Ever think about what career you would really be suited for? I know I have. Unfortunately, we humans seem to be really bad at predicting what will make us happy, and according to psychologist Dan Gilbert, when you are faced with a choice, you would be better off asking unknown people who have been in a similar situation instead of listening to your gut instinct.

In the same vein, perhaps we can try asking strangers about our career choices? Path101 is an interesting career site that puts you through a personality test and then suggests careers for you based on the character traits the analysis engine estimates from your answers. In addition, you can get career advice by posting questions, anonymously or openly, to other users. The company calls this community powered career discovery.

Path101 will also analyze your resume and compare it to a database of millions (!) of resumes they have collected. The site delivers lots of interesting statistics about what personality traits tend to be correlated to which jobs, to which new careers people in a certain job tend to go, and so on. There is a lot more to be found if you poke around the site. The IT Conversations podcast did a nice interview with the founders of the company.

A similar though perhaps more light-hearted service is (not to be confused with an excellent machine learning site called!), which helps you make choices (big or small) by putting you through a quiz about the choice in question. After completing the quiz, you get a recommendation about how to choose. Apparently, Hunch has some sort of algorithm that learns about you while you use the site, so that you get progressively shorter quizzes before the system recommends a decision.

An interesting thing about Hunch is that it has an API, so you can integrate it into your own applications. I’m not sure what kind of application it would be useful for, but presumably someone will figure it out soon!

Post Navigation