Follow the Data

A data driven blog

Archive for the month “July, 2009”

Predictions from Google search data

Google has started reporting some interesting findings about predictions based on web search data. I would guess that these things have been in the works for several years before Google went public with them.

Last year, they introduced Google Flu Trends, which basically monitors influenza-related searches and tries to predict outbreaks early by identifying geographical location that are suddenly showing a strong increase in such searches. An article describing the system was even published in the very high-profile scientific journal Nature. (Later, people started to use Twitter for flu monitoring.)

Lately, the official Google research blog has started to write about the possibilities of using Google search data to predict economic variables in the short term. A recent analysis they did, based on claims for unemployment benefits  in the U.S., seems to suggest that the U.S. economy is recovering.

From the blog post:

One of the strongest leading indicators of economic activity is the number of people who file for unemployment benefits. Macroeconomists Robert Gordon and James Hamilton have recently examined the historical evidence. According to Hamilton’s summary: “…in each of the last six recessions, the recovery began within 8 weeks of the peak in new unemployment claims.”

Let’s see if the prediction comes true!

The analysis is described in more detail in this paper.

Mood measuring machines

Danwei reports the results of an interesting exercise in data collection in China. Poll machines with buttons for “happy” or “unhappy” were set up at bus stops in six major cities and button-press counts were recorded for two weeks (from July 6 to July 20). Of course, there are various kinds of issues with sample bias here, but it’s still a fun idea.

The winning city was Beijing with about 56% respondents claiming to be happy. The other cities were Shanghai, Guangzhou, Kunming, Chengdu and Xi’an.

The Beijing News (Chinese) also reports that out of the 10 polled bus stops, Dengshi Xikou was the “happiest”, while the “unhappiest” one was, poignantly, The Beijing Children’s Hospital bus stop.

It would be interesting to know whether there are any other trends in the data, like significant differences in reported happiness depending on the time of day.

Netflix prize soon to be awarded

A few years back, an online movie rental site, NetFlix, promised a million dollars to anyone who could improve their recommendation algorithm by a certain percentage (10%). In the same way that Amazon recommends books you might be interested in, NetFlix recommends movies to users.

The NetFlix prize, as the challenge was called, turned out to be more important for data mining and related fields than at least I had anticipated. Many teams from all over the world put in enormous efforts to shave off the last percentage points needed to cross the finish line. At times, it was thought that the teams would be doomed to approaching the cut-off point asymptotically, never actually crossing it.

Last week, though, a team called BellKor did achieve the target percentage. That does not yet make them the winners, as other teams still have the chance to submit a better model before July 26.  Still, it is likely that they will win. At any rate, someone will have won, which is significant.

Although I have not followed the competition in much depth, I believe some of the lessons learned were about the surprising power of using multiple – and I mean many – predictive models. Some of the best predictors were linear combinations of over a hundred different kinds of models.

In particular, it turned out that methods based on “latent factors” in the data – regularities that can be fished out using so-called matrix factorization methods such as SVD or NMF – were very powerful tools for this application, especially when combined with “neighbourhood-based” methods, which basically make predictions by assuming that similar users (based on how they have previously rated films) will like similar films, or conversely, that films that have been similarly rated by different users will tend to be similar.

One of the members of the winning team BellKor, Yehuda Koren, has recently published an interesting paper which outlines strategies for the hard problem of accounting for preferences that change over time. Elements of these are likely to have been included in the prediction model that crossed the 10% line.

There is a lot to chew on in the paper for the more technically inclined, but I would just like to mention a simple and interesting trend Koren found in the Netflix data: that older movies tend to get higher ratings that newer ones.

Koren sets up two hypotheses that could explain this phenomenon:

- either, the customers will more readily choose to rent a new movie simply based on novelty value, while they would only choose and older movie after a careful selection process, which would lead to a greater likelihood of enjoying the movie,

- or, old movies are just better!

I can think of a couple of other possible explanations as well, such as the effects of nostalgia for movies seen a long time ago for instance, but anyway Koren goes on to compare these two hypotheses using the statistics in the Netflix database. By interpreting parameters in the statistical model he has set up, he concludes that the first explanation (a more careful selection process for old movies) is the more likely one.

The quantification of everything

An interesting quote from Esther Dyson in the Strategy + Business magazine, found via the Quantified Self blog.

It’s the quantification of everything. Not just marketing data — everything. Five years ago you’d read about diabetics who had to take their blood sugar readings or about these weirdos who put on pedometers when they walked. Now, that kind of measurement is everywhere. Web sites that seem at first glance like entertainment or service media are really devoted to managing and interpreting customers’ data about themselves. Mint and Wesabe track your banking data and financial transactions. Skydeck organizes cell-phone records; you can see whom you call most frequently or whom you used to call but haven’t called recently. You can compare your phone call patterns against other people’s. 23andMe does the same thing for genomes. The most fascinating thing in the world is a mirror.

Mobile phones, location and indexing the real world

Mobile phones are rapidly becoming powerful data acquisition devices, as described e. g. in recent (and good) articles in The Economist and Nature. Many phones have cameras, GPS systems and net connections, and some of them sport accelerometers, which can be used to measure the amount of calories burnt by the user, or even to track earthquakes.

A number of enterprising researchers have started to mine the location data that can be obtained from mobile phones (through information from mobile towers routing the communication). Last year, the complex-networks guru Albert-László Barabási and co-workers published a paper, Understanding individual human mobility patterns, where they studied movement trajectories of 100.000 (anonymized) mobile phone users. The result reported by the authors – that human movement is not random but shows high spatial and temporal regularity – was perhaps not as impressive as the sheer size of the data set.

For those who would like to try their hand at analyzing mobile phone data, MIT’s Reality Mining project provides an interesting and freely accessible data set. In this project, students carried (Nokia) phones and their trajectories were tracked. The subjects also answered various questions about themselves and their habits. The data gathered for the Reality Mining project included location information (again, through mobile towers), communication data (call records) and proximity data (using Bluetooth).

The researchers behind the project developed algorithm for extracting routine everyday patterns from user’s lives and claim they can predict their subjects’ next actions to a fairly good approximation.

The Economist article linked above quotes one of the MIT researchers, Alex Pentland, as saying that “… some handsets can capture information about individuals, such as their activity levels or even their gait, using built-in motion sensors.” This suggested to me that it might be possible to detect changes in gross motor patterns in an individual, such as those that have been shown to sometimes occur in depressed patients. Thus, a smart phone could be an “early warning system” for depression.

The Reality Mining group has spawned off a company, Sense Networks, that aims to bring location-based data to the commercial sphere in a big way. Their slogan is “Indexing the real world using location data for predictive analytics.”

Indexing the real world! Now that would be something.

Currently, Sense Networks offers a service, CitySense, for finding out where the action is in a city. I quote from the web site:

Citysense passively “senses” the most popular places based on actual real-time activity and displays a live heat map. The application intelligently leverages the inherent wisdom of crowds without any change in existing user behavior, in order to navigate people to the hottest spots in a city. [...]

The application learns about where each user likes to spend time – and it processes the movements of other users with similar patterns. In its next release, Citysense will not only answer “where is everyone right now” but “where is everyone like me right now.” Four friends at dinner discussing where to go next will see four different live maps of hotspots and unexpected activity. Even if they’re having dinner in a city they’ve never visited before.

Patient social networks

Online social networks for patients would seem to be rife with potential for medical discovery. Given a critical mass of patients who are communicating with each other and with a (morally responsible) service provider, there should be opportunities for e.g. relating disease progression to lifestyle, demographics, etc.,  for discovering unexpected relationships between different diseases, and perhaps for enabling a more precise categorization of diseases. Of course, patients would also be highly motivated to help in the development of new drugs for their particular disease, although it’s not clear how this motivation should best be leveraged.

CureTogether is an interesting effort in this direction. It describes itself using the term Open Source Health Research and aims to enable patients to learn from their peers and to get personalized health information. CureTogether also allows patients (or anyone, really) to participate in and even fund research directed toward a specific disease.

In addition, CureTogether has released a couple of “crowdsourced” books which I think are pretty interesting. Migraine Heroes contains stories and data from 271 migraine patients who describe, in their own words, symptoms, side effects of treatments and so on. The information collected for the book also suggested novel and surprising co-morbid conditions (secondary diseases or disorders in addition to migraine). Endometriosis Heroes is another book in the same vein.

Patients Like Me is another online social network for patients. It was pretty extensively discussed in an Interviews with Innovators podcast, from which I have borrowed a couple of quotes. Basically, Patients Like Me allows patients to share data about their diseases, the drugs they are taking, side effects from treatments and so on. The company makes money by selling this data (in anonymized form) to drug companies, which hopefully leads to a win-win situation where the drug companies get relevant information and the patients get better drugs. The company “puts patients in direct contact with the companies who can help them“, as Jamie Heywood put it in the podcast linked above.

Patients Like Me provides “structured data for illnesses” – a common format for describing different illnesses in a similar way. This is of course crucial for computer readability and efficient statistical analysis. The company sees itself as “training a group of expert consumers to do evidence based decisions” through their service.

An interesting tidbit from the company’s web site:

Future state modeling – Simply “tracking” a patient’s progression has never been the goal for us; we’ve always wanted to take past information and use it to predict the future state of an individual patient. In relatively linear diseases like ALS, that means we can help patients to plan in advance for when they might need a wheelchair or other equipment. It’s often the case that ALS/MND patients don’t get the equipment they need until several months after they could have benefited from having it. Such a tool would give a customized prediction for the individual patient. After all, most of us don’t want to know about the “average” patient, we want to know about a “patient like me”!

Track your habits with Twitter

your.flowingdata is a new service that lets you track your daily habits (which might be any kind of habits – eating, working out, watching TV …) by sending messages to a dedicated Twitter channel. There is a simple syntax for describing what you have been up to. A nifty idea in all its simplicity.

So why track your habits? I’ll borrow this explanation from a Wired article (which, by the way, is worth reading in its entirety, although it perhaps gives too much credit to Nike, as argued in the article comments):

Using a flood of new tools and technologies, each of us now has the ability to easily collect granular information about our lives—what we eat, how much we sleep, when our mood changes.

And not only can we collect that data, we can analyze it as well, looking for patterns, information that might help us change both the quality and the length of our lives. We can live longer and better by applying, on a personal scale, the same quantitative mindset that powers Google and medical research. Call it Living by Numbers—the ability to gather and analyze data about yourself, setting up a feedback loop that we can use to upgrade our lives, from better health to better habits to better performance.

Why not start by tracking your happiness?

Sequencing data storm

Today, I attended a talk given by Wang Jun, a humorous and t-shirt-clad whiz kid who set up the bioinformatics arm of Beijing Genomics Institute (BGI) as a 23-year-old PhD student, became a professor at 27, and is now the director of BGI’s facility in Shenzhen, near Hong Kong. Although I work with bioinformatics at a genome institute myself, this presentation really drove home how much storage, computing power and know-how is really required for biology now and in the near future.

BGI does staggering amounts of genome sequencing – “If it tastes good, sequence it! If it is useful, sequence it!” as Wang Jun joked – from indigenous Chinese plants to rice, pandas and humans. They have a very interesting individual genome project where they basically apply many different techniques on samples from the same person and compare the results against known references. One of many interesting results from this project was the finding that human genomes not only vary in single “DNA letter” variants (so called SNPs, single nucleotide polymorphisms) or the number of times certain stretches of DNA are repeated (“copy number variations”) – it now turns out there are DNA snippets that, in largely binary fashion, some people have and some don’t.

Although the existing projects demand a lot of resources and manpower – the BGI has 250 bioinformaticians (!) which is still too few; according to Wang they want to quickly increase this number to 500 – this is nothing compared to what will happen after the next wave of sequencing technologies, when we will start to sequence single cells from different (or the same) tissues in an individual. Already, the data sets generated are so vast that they cannot be distributed over the internet. Wang  recounted how he had to bring ten terabyte drives to Europe by himself in order to share his data with researchers at EBI (European Bioinformatics Institute). Now, they are trying out cloud computing as a way to avoid moving the data around.

Wang attributed a lot of BGI’s success to young, hardworking programmers and scientists – many of them university dropouts – who don’t have any preconceptions about science and therefore are prepared to try anything. “These are teenagers that publish in Nature,” said Jun, apparently feeling that he was (at 33) already over the hill. “They don’t run on a 24-hour cycle like us, they run on 36h-cycles and bring sleeping bags to the lab.”

All in all, good fun.

Reverse engineering social security numbers

The latest issue of PNAS (Proceedings of the National Academy of Sciences of the United States of America; a well-known scientific journal) contains two interesting pieces of statistical analysis. Luckily, they are both freely downloadable even if you don’t have access to a subscription.

Predicting Social Security numbers from public data claims that USA:s social security numbers (SSN), which are supposed to be confidential, are actually to a certain extent predictable, at least for younger people, given information such as birth date and location. Basically, the authors (from Carnegie Mellon university) have tried to reverse-engineer the SSN assignment process using available information about this process, including the so-called SSA Death Master File which is publicly available and contains data about SSN assignments for people who have been reported as dead.

The authors detected various correlations between e.g. date of birth and all the nine digits in the SSN, and eventually (after much visual inspection and several rounds of model refinement) constructed a regression model for predicting digits in an SSN based on birth date. They managed to correctly predict the SSN of 8.5% of deceased individuals in less than 1,000 tries.

Naturally, this suggests possibilities for e.g. identity theft and poses the question whether social security numbers should be replaced by something else.

Another study in the latest PNAS, NIH funding trajectories and their correlations with US health dynamics from 1950 to 2004, suggests that funding of research relating to certain diseases leads to a time-lagged decrease in deaths due to those diseases – in other words, the research appears pay off with a time lag. In order to do their analysis, the authors compiled data on NIH (the US National Institutes of Health) funding starting in 1937 and compared those to mortality data for cardiovascular disease, stroke, cancer, and diabetes.

Ask a stranger

Ever think about what career you would really be suited for? I know I have. Unfortunately, we humans seem to be really bad at predicting what will make us happy, and according to psychologist Dan Gilbert, when you are faced with a choice, you would be better off asking unknown people who have been in a similar situation instead of listening to your gut instinct.

In the same vein, perhaps we can try asking strangers about our career choices? Path101 is an interesting career site that puts you through a personality test and then suggests careers for you based on the character traits the analysis engine estimates from your answers. In addition, you can get career advice by posting questions, anonymously or openly, to other users. The company calls this community powered career discovery.

Path101 will also analyze your resume and compare it to a database of millions (!) of resumes they have collected. The site delivers lots of interesting statistics about what personality traits tend to be correlated to which jobs, to which new careers people in a certain job tend to go, and so on. There is a lot more to be found if you poke around the site. The IT Conversations podcast did a nice interview with the founders of the company.

A similar though perhaps more light-hearted service is hunch.com (not to be confused with an excellent machine learning site called hunch.net!), which helps you make choices (big or small) by putting you through a quiz about the choice in question. After completing the quiz, you get a recommendation about how to choose. Apparently, Hunch has some sort of algorithm that learns about you while you use the site, so that you get progressively shorter quizzes before the system recommends a decision.

An interesting thing about Hunch is that it has an API, so you can integrate it into your own applications. I’m not sure what kind of application it would be useful for, but presumably someone will figure it out soon!

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 46 other followers