Follow the Data

A data driven blog

Archive for the category “Articles”

Not contagious after all?

(via Decision Science News) Ouch! A new paper titled “The Spread of Evidence-Poor Medicine via Flawed Social-Network Analysis” (published here and available in manuscript format on arXiv) has come out arguing very strongly against the conclusions drawn by Christakis and Fowler in a series of papers where they put forward the idea that things like obesity and smoking can be transmitted through social networks; a kind of “social contagion.” I blogged about these ideas a while back after both Wired and the New York Times had published articles on them. The title (harsh!) and the abstract speaks for itself:

The chronic widespread misuse of statistics is usually inadvertent, not intentional. We find cautionary examples in a series of recent papers by Christakis and Fowler that advance statistical arguments for the transmission via social networks of various personal characteristics, including obesity, smoking cessation, happiness, and loneliness. Those papers also assert that such influence extends to three degrees of separation in social networks. We shall show that these conclusions do not follow from Christakis and Fowler’s statistical analyses. In fact, their studies even provide some evidence against the existence of such transmission. The errors that we expose arose, in part, because the assumptions behind the statistical procedures used were insufficiently examined, not only by the authors, but also by the reviewers. Our examples are instructive because the practitioners are highly reputed, their results have received enormous popular attention, and the journals that published their studies are among the most respected in the world. An educational bonus emerges from the difficulty we report in getting our critique published. We discuss the relevance of this episode to understanding statistical literacy and the role of scientific review, as well as to reforming statistics education.

Cosma Shalizi has co-authored another paper (available here) which makes a similar point in a much more, let’s say, polite way. My impression is that Shalizi is both sharp and trustworthy (I’ve learned a lot about statistics from his blog) so I’m inclined to think he is on to something.

Biology-inspired algorithm design and non-obvious news discovery

There is a new Science article which seems really cool, although I haven’t had time to get past the paywall yet. The title is “A Biological Solution to a Fundamental Distributed Computing Problem” and the gist of it is pretty simple: a research group has found that an important procedure in distributed computing, “maximal independent set selection”, has been solved in a simple and efficient way in a kind of fly’s nervous system development. An algorithm based on the process that occurs in the fly’s immature nervous system can be directly applied to a network of sensors, for example.

In other news, Bradford Cross, who started the data-driven flight-delay prediction company FlightCaster, is starting a new company called Woven. It will be about discovering news you are interested in, and the platform will explicitly consider a conundrum that I’ve often been thinking about, which is the following (and possibly mentioned in some earlier blog post): Do you really want to read news that are always perfectly tailored to your interests? Wouldn’t this cause you to miss a lot of interesting information that you get from e.g. browsing the newspaper and “accidentally” reading about things you didn’t know about but which are actually kind of interesting? Bradford Cross mentions this in a recent interview and says that he started to “miss the serendipity that a newspaper provides”.  So far so good, but how to actually implement this kind of quasi-random content exposure (I tend to think of it as a kind of beneficial noise) into a news discovery service? I guess we will soon see what Woven has in mind.

Finally, the PayPal Developer Network (!) has a pretty nice tutorial about analyzing and visualizing the recently released World Bank data using tools like Java servlets, Google Charts and MySQL. The World Bank data would easily deserve a verbose blog post of its own (and I was planning one several months ago) but that will have to wait until I’ve taken a proper look at it.

The next big idea in language, history and the arts? Data.

This New York Times article is more than a month old, but it ties in quite nicely with the “Culturomics” I mentioned in the previous post.

Funny quote: “This alliance of geeks and poets has generated exhilaration and also anxiety.”

Games and competitions as research tools

The first high-profile paper describing crowdsourced research results has just been published in Nature. (I am excluding things like folding@home from consideration here, since in those cases the crowds are donating their processor cycles rather than their brainpower.) The paper describes how the game FoldIt (which I blogged about roughly a year ago) was used to refine predicted protein structures. This is an excerpt from the abstract:

Foldit players interact with protein structures using direct manipulation tools and user-friendly versions of algorithms from the Rosetta structure prediction methodology, while they compete and collaborate to optimize the computed energy. We show that top-ranked Foldit players excel at solving challenging structure refinement problems in which substantial backbone rearrangements are necessary to achieve the burial of hydrophobic residues. Players working collaboratively develop a rich assortment of new strategies and algorithms; unlike computational approaches, they explore not only the conformational space but also the space of possible search strategies. The integration of human visual problem-solving and strategy development capabilities with traditional computational algorithms through interactive multiplayer games is a powerful new approach to solving computationally-limited scientific problems.

So in other words, FoldIt tries to capitalize on intuitive or implicit human problem-solving skills to complement brute-force computational algorithms. Interestingly, all FoldIt players are credited as co-authors of the Nature, so technically I could count myself as one of them, seeing that I gave the game a try last year. (It’s a lot of fun, actually.)

I think games and competitions (which are almost the same thing, really) will soon be used a lot more than they are today in scientific research (and of course other areas like productivity, innovation and personal health management, too.) The Kaggle blog had an interesting post about competitions as real-time science. In a short time, Kaggle has set up several interesting prediction contests. The Eurovision Song Contest and Football World Cup contests were, I guess, mostly for fun. The interesting thing about the latter one, though, was that it was set up as a “Take on the quants” contest, where quantitative analysts from leading banks were pitted against other contestants – and they did terribly. Now the quants have a chance to redeem themselves in the INFORMS challenge, which is about their specialty area – stock price movements …

Anyway … the newest Kaggle contest is very interesting for me as a chess enthusiast. It is an attempt to improve on the age-old (well … I think it was introduced in the late 1960s) Elo rating formula, which is still used in official chess ranking lists. This system was invented by a statistician, Arpad Elo, based mostly on theoretical considerations, but it has done its job OK. The Elo ratings should ideally be able to predict results of games with a reasonable accuracy (as an aside, people have also often tried to use it to compare players from different epochs to each other, which is a futile exercise, but that’s a topic for another post), but where it really does that has not been very thoroughly analyzed. The Elo system also has some less well understood properties like an apparent “rating inflation” (which may or may not be an actual inflation). Some years ago, a statistician named Jeff Sonas started to develop his own system that he claimed was able to predict results of future games more accurately.

Now, Sonas (with Kaggle) has taken the next step, which is to arrange a competition to see if this will yield an even better system. The competitors get results of 65,000 recent games by top players and attempt to predict the outcome of a further 7,809 games. At the time of writing, there are already two rating systems that are doing better than Elo (see the leaderboard).

By the way, if you think chess research is not serious enough, Kaggle also has a contest about predicting HIV progression. I’m sure they have other scientific prediction contests lined up (I’ve noticed a couple of interesting – and lucrative – ones at Innocentive too.)

Sergey Brin’s new science and IBM’s Jeopardy machine

Two good articles from the mainstream press.

Sergey Brin’s Search for a Parkinson’s Cure deals with the Google co-founders quest to minimize his high hereditary risk of getting Parkinson’s disease (which he found out through a test from 23andme, the company his wife founded) while simultaneously paving the way for a more rapid way to do science.

Brin is proposing to bypass centuries of scientific epistemology in favor of a more Googley kind of science. He wants to collect data first, then hypothesize, and then find the patterns that lead to answers. And he has the money and the algorithms to do it.

This idea about a less hypothesis-driven kind of science, based more on observing correlations and patterns, surfaces once in a while. A couple of years ago, Chris Anderson received a lot of criticism for describing what is more or less the same idea in The End of Theory. You can’t escape the need for some sort of theory or hypothesis, and when it comes to something like Parkinson we just don’t know enough about its physiology and biology yet. However, I think Brin is right in emphasizing the need to get data and knowledge about diseases to circulate more quickly and to try to milk the existing data sets for what they are worth. If nothing else, his frontal attack on Parkinson’s may lead to improved techniques for dealing with über-sized data sets.

Smarter Than You Think is about IBM’s new question-answering system Watson, which is apparently now good enough to be put in an actual Jeopardy competition on US national TV (scheduled to happen this fall). It’s a bit hard to believe, but I guess time will tell.

Most question-answering systems rely on a handful of algorithms, but Ferrucci decided this was why those systems do not work very well: no single algorithm can simulate the human ability to parse language and facts. Instead, Watson uses more than a hundred algorithms at the same time to analyze a question in different ways, generating hundreds of possible solutions. Another set of algorithms ranks these answers according to plausibility; for example, if dozens of algorithms working in different directions all arrive at the same answer, it’s more likely to be the right one.

IBM plans to sell Watson-like systems top corporate customers for sifting through huge document collections.

Surprising self-experimentation

Seth Roberts, a pioneer in self-experimentation, has written an extremely interesting article called “The unreasonable effectiveness of my self-experimentation“  [PDF link]. In it, he tries to explain why his self-experiments were, in his opinion, so much more successful than a lot of conventional research. As he puts it himself in the paper:

[...] I was not an expert in what I studied and my research cost almost nothing. I did it in my spare time. In spite of this, my self-experimental research was far better than my mainstream research [...]

Roberts describes how he started with self-experimentation by counting his pimples every day and trying a treatment to get rid of them. Eventually, he would discover surprising facts about himself, for example that drinking sugar water would tend to make him lose weight, and that eating breakfast would tend to make him wake up too early (but that standing up a lot would make him wake up later.) One of the main reasons he gives for his success is the freedom from academic pressure:

Myself-experimentation was not my job. For a long time, I did not expect to publish it; even later, after I decided to, I did not plan to use it to gain status within a profession. This freed me to (a) do whatever worked and (b) take as long as necessary. Professional scientists cannot try anything and cannot take as long as necessary. As Dyson [...] said, ‘‘In almost all the varied walks of life, amateurs have more freedom to experiment and innovate [than professionals].”

The paper is interesting throughout.

Edit 2/6 2010: I found another paper by Roberts, a 61-page whopper called “Self-experimentation as a source of new ideas: Ten examples about sleep, mood, health and weight“, where he goes into a lot more detail (complete with pretty graphs plotted in R) about his various experiments. Definitely worth a look too.

Viewpoints on self-tracking

Here are some interesting articles on self-tracking published during the spring.

The data-driven life, a very meaty and well-researched article in The New York Times. It’s written by Gary Wolf, who is a co-host of the self-tracking blog, The Quantified Self. Standout quote:

With my spreadsheet, I inadvertently transformed myself into the mean-spirited, small-minded boss I imagined I was escaping through self-employment.

An interview with Nicholas Felton, who publishes a “personal annual report” crammed with visualizations of data he has collected about himself. Standout quote:

I think it would be more accurate to say that the age of the illusion of privacy is over. Your activities have long been transparent to credit card, mobile phone operators and others… now we have been given the tools to reveal this information socially (intentionally or unintentionally).

Numbers from the heart, a highly interesting essay by professor Ramesh Rao, who has done some heavy-duty signal analysis of his heart rate variability while meditating, running and sleeping, amongst other things. Standout quote:

The irony of getting attached to a practice that teaches detachment got me to take a look at Poincare plots of different styles of Yoga.

The essay also includes an interesting passage about entoptic phenomena (visual phenomena generated “internally” by the nervous system.)

Why I stopped tracking by Alexandra Carmichael is a powerful reminder of the potential drawbacks of self-tracking.

1.2 zettabyte of data

OK, so I was a bit slow to discover this, but The Economist has a special report on big data which is freely available online. That is, the individual articles are free, and a PDF compiling them is supposed to cost 3 GBP, but I was able to download it for free here without doing anything special.

A fun fact that I learned from this report is that the total amount of information in the world this year is projected to reach 1.2 zb (zettabyte) – which is 1.2×10^21 byte. How on earth did they come up with that figure…? Anyway, this report is worth a read, as it touches on things like business analytics, web mining, open government data and augmented cognition, while also giving some well deserved love to R and open source software.

Couch DB — mapreduce for the masses

Couch DB, since a while back an Apache Foundation project, is a document-oriented database that can be queried with simple javascript queries in map/reduce fashion. Couch DB is built upon Erlang, which is a very interesting functional language built for extreme reliability in the telecom industry. One of the advantages of erlang is the support for parallelism, just add more cores and servers, and the map/reduce queries will go faster. Normal databases like mysql or postgres cant scale to several servers, and the end game is to buy one really big iron if you have built your application around a single database, that problem is no more with technology like CouchDB. This neat interactive demo shows what couch db is all about.

Body computing, preventive, predictive and social medicine

There have been many interesting articles and blog posts about the future of medicine, and specifically about the need to automatically monitor various physiological parameters, and, importantly, to start focusing more on health rather than disease; prevention rather than curing. The latter point has been stressed by Adam Bosworth, the former head of Google Health, in interviews like this one (audio) and this one (video, “The Body 2.0″). Bosworth is one of the founders of a company, Keas, that wants to help people understand their health data, set health goals and pursue them. He has a new blog post where he talks about machine learning in the context of health care. He (probably rightly) sees health care as lagging behind in adoption of predictive analytics. But he thinks this will change:

All the systems emerging to help consumers get personalized advice and information about their health are going to be incredible treasure troves of data about what works. And this will be a virtuous cycle. As the systems learn, they will encourage consumers to increasingly flow data into them for better more personalized advice and encourage physicians to do the same and then this data will help these systems to learn even more rapidly. I predict now that within a decade, no practicing physician will consider treating their patients without the support/advice of the expertise embodied in the machine learning that will have taken place. And finally, we will truly move to an evidence based health care system.

Along similar lines, the Broader Perspective blog writes about the “three tiers of medicine” that may make up the future healthcare system. The first tier consists of automated health monitoring tools that collect information about your health, The second tier is about preventive medicine and involves “health coaches”, who “…incorporate genomic data, together with family history and current phenotype and biomarker data into an overall care plan“. Finally, the third tier is the traditional health care system of today (hospitals, doctors, nurses).

I learned a new term for the enabling technology for the first (data-collection) tier: body computing. The Third Body Computing Conference will be hosted by the University of Southern California on Friday (9 October). The conference’s definition of body computing is that

“Body Computing” refers to an implanted wireless device, which can transmit up-to-the-second physiologic data to physicians, patients, and patients’ loved ones.

A new article about the future of health care in Fast Company also talks about body computing and predictive/preventive health care:

Wireless monitoring and communication devices are becoming a part of our everyday lives. Integrated into our daily activities, these devices unobtrusively collect information for us. For example, instead of doing an annual health checkup (i.e. cardiac risk assessment), near real-time health data access can be used to provide rolling assessments and alert patients of changes to their health risk based on biometrics assessment and monitoring (blood pressure, weight, sleep etc). With predictive health analytics, health information intelligence, and data visualization, major risks or abnormalities can be detected and sent to the doctor, possibly preempting complications such as stroke, heart attack, or kidney disease.

Although the article is named The Future of Health Care Is Social, it actually talks mostly about self-tracking and predictive analytics. It does go into social aspects of future healthcare, like online health/disease-related networks such as PatientsLikeMe or CureTogether. All in all, a nice article.

And finally (if anyone is still awakw), it has been widely reported that IBM has joined the sequencing fray and are trying to develop a nanopore-based system, a “DNA transistor”, for cheap sequencing. There are now several players in this area (for example, Oxford Nanopore, Pacific Biosystems, NABSYS) and some of them are bound to lose out – time will tell who will emerge on top. Anyway, the reason I mentioned this is partly that IBM explicitly connected this announcement to healthcare reform and personalized healthcare (IBM CEO also wants to resequence the health-care system) and partly because of the surprising (to me) fact that “[...] IBM also manages the entire health system for Denmark.” Really?

By the way, a good way to get updates on body computing is to follow Dr Leslie Saxon on Twitter.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 46 other followers