Follow the Data

A data driven blog

Archive for the month “May, 2010”

Viewpoints on self-tracking

Here are some interesting articles on self-tracking published during the spring.

The data-driven life, a very meaty and well-researched article in The New York Times. It’s written by Gary Wolf, who is a co-host of the self-tracking blog, The Quantified Self. Standout quote:

With my spreadsheet, I inadvertently transformed myself into the mean-spirited, small-minded boss I imagined I was escaping through self-employment.

An interview with Nicholas Felton, who publishes a “personal annual report” crammed with visualizations of data he has collected about himself. Standout quote:

I think it would be more accurate to say that the age of the illusion of privacy is over. Your activities have long been transparent to credit card, mobile phone operators and others… now we have been given the tools to reveal this information socially (intentionally or unintentionally).

Numbers from the heart, a highly interesting essay by professor Ramesh Rao, who has done some heavy-duty signal analysis of his heart rate variability while meditating, running and sleeping, amongst other things. Standout quote:

The irony of getting attached to a practice that teaches detachment got me to take a look at Poincare plots of different styles of Yoga.

The essay also includes an interesting passage about entoptic phenomena (visual phenomena generated “internally” by the nervous system.)

Why I stopped tracking by Alexandra Carmichael is a powerful reminder of the potential drawbacks of self-tracking.

Data-driven venture capitalists and more

Via Bradford Cross’ excellent post on data-driven startups (he has one himself – FlightCaster, a flight-delay prediction service that I mentioned last year), I learned the interesting fact that there is now at least one venture capital company that specializes exclusively in data-driven or “big data” startups. This company is IA Ventures, and it “invests in companies that create tools to manage and extract value from massive, occasionally unstructured, often real-time data sets“. I particularly like this sentence from their web page: “Most data generated today is simply treated as exhaust—lost forever along with the valuable insights held in it.” This is very true, and there are sure to be enormous opportunities for those who are clever enough to turn this “exhaust” – in the form of structured or unstructured data – into a product. The above-mentioned post by Bradford Cross tries to suggest some public data sets that might be leveraged by a savvy startup.

One nice example of a company that uses seemingly mundane information – cab pickup frequencies in New York City – to create a useful product is Sense Networks. They perform “some heavy-duty data crunching” on information from taxi companies and mobile phone records to predict the best places to get a cab in NYC. The predictor is implemented as an iPhone application called CabSense.  In a recent podcast named Reality Mining for Companies, Alex “Sandy” Pentland, a professor who is also on Sense Networks’ management team, describes how even more trivial information like movement patterns of individuals inside a company can actually be analyzed to improve productivity and working conditions. Did you know that productivity goes up 10% if you have coffee with a cohesive group of co-workers?

Anyway, it will be interesting to see how the data-driven startups funded by IA Ventures turn out. One of them, Recorded Futures, has also recently received funding from Google. This company is based in US and Sweden, and one of the people behind it is Christopher Ahlberg, the founder of Spotfire (a successful analytics company which was built around a user-friendly visualization tool and sold to Tibco a couple of years ago). Recorded Futures attempts to predict future events (!) by analyzing and indexing various sources (news, analysis pieces, prognoses etc.) on the web. I assume they use some sort of natural language processing to recognize entities (like names of people and companies, dates etc.) and infer relationships between them from indexed reports. The company’s blog has some interesting visualizations that summarize, for example, the lives of some terrorist suspects who have recently been in the news. My favorite entry in the blog (if only for its name) is “Has Hu Jintao’s behavior changed?” These blog case studies do not contain predictions of future events, but rather a kind of proof of concept that the system can reconstruct a reasonable timeline showing important events in a person’s (or maybe a company’s) life and display it in an effective way. I did register for a couple of “Futures“, an email based service where you get alerts about possible future events connected to a set of keywords, but the only prediction I have received so far was apparently based on some faulty date recognition.

In case you read Swedish (or are able to tolerate Google translations), the best summary I have found of what is currently known about Recorded Futures is at the Cornucopia blog.

Peer-reviewed life?

For those curious about where self-tracking (or self-measurements/self-monitoring/personal informatics, or whatever we should call it) might be going in the future, it could be worth glancing through the papers from an interesting workshop, Know Thyself: Monitoring and Reflecting on Facets of One’s Life, which was held in Atlanta in April. The papers have intriguing titles like Life-browsing with a Lifetime of Email, Computational Models of Reflection, Collaborative Capturing of Significant Life Memories and From Personal Health Informatics to Health Self-management. A striking quote from a paper entitled Assisted Self Reflection: Combining Lifetracking, Sensemaking, & Personal Information Management by Moore et al:

Just as we are able to submit papers to peer-reviewed con-
ferences and journals, we could anonymously share selected
portions of our life activities for peer or professional consulta-
tion when making major career decisions, learning a new skill
or in the process of recovery. By seeing ourselves through
the eyes of others, we are more able to normalize behavior
patters and raise awareness of suppressed abnormalities.

I’m not sure I am ready for peer review yet … maybe some day…

Machine learning competitions and algorithm comparisons

Tomorrow, 29 May 2010, a lot of (European) people will be watching the Eurovision Song Contest to see which country will take home the prize. Personally, I don’t really care about who wins the contest itself, but I do care (somewhat) about which predictor will win the Eurovision Voting Forecast competition arranged by kaggle.com. Kaggle describes itself as “a platform for data mining, bioinformatics and forecasting competitions“. It provides an objective framework for comparing techniques and “allows organizations to have their data scrutinized by the world’s best statisticians.”

Contests like this are fun, but they can also have more serious aims. For instance, Kaggle also hosts a competition about predicting HIV progression based on the virus’ DNA sequence. The currently leading submission has already improved on the best methods reported in literature, and so a post at Kaggle’s No Free Hunch blog asks whether competitions might be the future of research. I think they may well be, at least in some domains. A few months back, I mentioned an interesting challenge at Innocentive which is essentially a very difficult pure research problem, and it will be interesting to learn how the winning team there did it (if any details are disclosed). (I signed up for this competition myself, but haven’t been able to devote more than one or two hours to it so far, unfortunately.)

There are other platforms for prediction competitions as well, for instance TunedITs challenge platform, which allows university teachers to “make their courses more attractive and valuable, through organization of on-line student competitions instead of traditional assignments.” TunedIT also has a research platform where you can run automated tests on machine learning algorithms and get reproducible experimental results. You can also benchmark results against a knowledge base or contribute to and use a repository of various data sets and algorithms.

Another initiative for serious evaluation of machine learning algorithms in various problem domains is MLcomp. Here, you can upload your own datasets (or use pre-loaded ones) and run existing algorithms on them through a web interface. MLcomp then reports various metrics that allow you to compare different methods.

By the way, 22 teams participated in Kaggle’s Eurovision challenge, and Azerbaijan is the clear favorite, having been picked as the winner by 14 teams. Let’s see how it goes tomorrow.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 46 other followers