Follow the Data

A data driven blog

Archive for the month “December, 2009”

Link roundup

Gearing up into Christmas mode, so no proper write-up for these (interesting) links.

Personalized medicine is about data, not (just) drugs. Written by Thomas Goetz of The Decision Tree for Huffington Post. The Decision tree also has a nice post about why self-tracking isn’t just for geeks.

A Billion Little Experiments (PDF link). An eloquent essay/report about “good” and “bad” patients and doctors, compliance, and access to your own health data.

Latent Semantic Indexing worked well for NetFlix, but not for dating. MIT Technology Review writes about how the algorithms used to match people at (based on latent semantic indexing / SVD) are close to worthless. A bit lightweight, but a fun read.

A podcast about data mining in the mobile world. Featuring Deborah Estrin and Tom Mitchell.  Mitchell just recently wrote an article in Science about how data mining is changing: Mining Our Reality (subscription needed). The take-home message (or one of them) is that data mining is becoming much more real-time oriented. Data are increasingly being analyzed on the fly and used to make quick decisions.

How Zeo, the sleep optimizer, actually works. I mentioned Zeo in a blog post in August.


Experiments on the Amazon Mechanical Turk

The Amazon Mechanical Turk is a pretty remarkable marketplace for work which hooks up “requesters” who need to repeat a simple and often tedious task a large number of times with a large pool of “workers” who will complete the task for a small amount of money. It’s a way to have access to an on-demand global workforce at any hour – a bit like the human version of the Amazon Elastic Compute Cloud perhaps. The Mechanical Turk tasks require human intelligence to solve (if they didn’t, you might as well write a  program), and many of the requested tasks look like they will be used as benchmarks for predictive algorithms. For example, one of the active tasks right now specifies that the requester will show the worker a LinkedIn profile and then ask him/her to judge which of ten other profiles it is similar to. The payment for completing the task is $0.03 and the allotted time is 10 minutes. This seems like an interesting way to get class labels for calibrating a machine learning algorithm that finds people with similar profiles in LinkedIn.

Experimental Turk is a cool project which leverages the Mechanical Turk for (social science) research purposes. They have recreated several classical experiments in social science and economics by this “crowd-labor” approach. As far as I can tell, most of the previously published, “classical” results have been reconfirmed by the Experimental Turk. For instance, in the “anchoring” experiment (more about anchoring here), 152 workers were asked a question about how many African countries that are in the United Nations.

Approximately half of the participants was asked the following question:Do you think there are more or less than 65 African countries in the United Nations?

The other half was asked the following question:

Do you think there are more or less than 12 African countries in the United Nations?

Both the groups were then asked to estimate the number of African countries in the United Nations. As expected, participants exposed to the large anchor (65) provided higher estimates than participants exposed to the small anchor (12) […] It should be noted that means in our data (42.6 and 18.5 respectively) are very similar to those recently published by Stanovich and West (2008; 42.6 and 14.9 respectively).

The other experiments are also about different kinds of unconscious biases and heuristics. Fun stuff. Scintillae has a guide to doing experiments on the Mechanical Turk.


Just discovered Timetric, another online data visualization/analysis company. They’re based in London and are focusing mostly on time series analysis (which is reflected in the name, of course). Timetric’s platform is pretty open, with good sharing/publishing/embedding functions and an API with client libraries for Python, Ruby and LabVIEW. There’s also a “formula language” for  combining different time series using an Excel-like syntax. It’s fun to play around with the 140.000+ time series that they’ve made available, although the data seem rather Britain-centered at the moment.

Oh, and they have two blogs as well: Byline, which is about data in the news, and the Timetric Blog, which is more about the product itself.

War data

An interesting article about analyzing and modeling data from wars and conflicts was published in Nature yesterday. On first glance, it looked like one of those “look, we found another power law” papers, but after I had read this interview with one of the authors, I changed my mind – it’s really quite interesting.

The authors compiled “… a collection of state-of-the-art datasets for a wide range of modern wars  from […] a range of sources including NGO reports, media streams, governmental databases and social scientists who are experts in specific conflicts. […] The result was a database of over 54,000 unique events covering 11 different wars. The data collection method utilized an open-source intelligence methodology.”

Even this data-collection effort is interesting in itself. According to Sean Gourley (the researcher interviewed in the link above), the statistical pattern remains similar no matter what data source (governmental/academic/mass media) you are looking at.

One of the important results of the statistical analysis was that there is a “… common pattern in both the size and timing of violent events within modern insurgent wars […] observed […] across multiple different conflicts from Iraq to Sierra Leone [and] independent of geography, ideology, politics or religion.” The patterns are not observed in older wars and thus seem to be unique to modern wars.

Also, attacks aren’t randomly distributed across a conflict but tend to be clustered together. Gourley has an intriguing explanation for this: “The cause of this clustering is coordination via a global signal and competition amongst groups for media exposure and resources.”

The authors also went beyond the statistical analysis and set up a mathematical model describing the structure of conflicts – they even call it a “unified theory of insurgency“.

The interview with Gourley mentions a friend of a friend, Aaron Clauset, who has done similar work on the statistics of terrorist attacks. Incidentally (coming back to the ubiquitous “x is power-law distributed” papers I mentioned in the beginning), one of Aaron’s papers contains very useful methodology for ruling out that a distribution is a power law.

Far-out stuff

Some science fiction-type nuggets from the past few weeks:

Google does machine learning using quantum computing. Apparently, a “quantum algorithm” called Grover’s algorithm can search an unsorted database in O(√N) time. The Google blog explains this in layman’s terms:

Assume I hide a ball in a cabinet with a million drawers. How many drawers do you have to open to find the ball? Sometimes you may get lucky and find the ball in the first few drawers but at other times you have to inspect almost all of them. So on average it will take you 500,000 peeks to find the ball. Now a quantum computer can perform such a search looking only into 1000 drawers.

I’ve absolutely no clue how this algorithm works – although I did take an introductory course in quantum mechanics many a moon ago, I’ve forgotten everything about it and the course probably didn’t go deep enough to explain it anyway. Google are apparently collaborating with a Canadian company called D-Wave, who develop hardware for realizing something called a “quantum adiabatic algorithm” by “magnetically coupling superconducting loops”. It is interesting that D-Wave are explicitly focusing on machine learning; the home page states that “D-Wave is pioneering the development of a new class of high-performance computing system designed to solve complex search and optimization problems, with an initial emphasis on synthetic intelligence and machine learning applications.”

Speaking of synthetic intelligence, the winter issue of H+ Magazine contains an article by Ben Goertzel where he discusses the possibility that the first artificial general intelligence will arise in China. The well-known AI researcher Hugo de Garis, who runs a lab in Xiamen in China, certainly believes that this will happen. In his words:

China has a population of 1.3 billion. The US has a population of 0.3 billion. China has averaged an economic growth rate of about 10% over the past 3 decades. The US has averaged 3%. The Chinese government is strongly committed to heavy investment into high tech. From the above premises, one can virtually prove, as in a mathematical theorem, that China in a decade or so will be in a superior position to offer top salaries (in the rich Southeastern cities) to creative, brilliant Westerners to come to China to build artificial brains — much more than will be offered by the US and Europe. With the planet‘s most creative AI researchers in China, it is then almost certain that the planet‘s first artificial intellect to be built will have Chinese characteristics.

Some other arguments in favor of this idea mentioned in the article are that “One of China‘s major advantages is the lack of strong skepticism about AGI resulting from past failures” and that China “has little of the West‘s subliminal resistance to thinking machines or immortal people”.

(By the way, the same issue contains a good article by Alexandra Carmichael on subjects frequently discussed on this blog. The most fascinating detail from that article, to me, was when she mentions “self-organized clinical trials“; apparently users of PatientsLikeMe with ALS had set up their own virtual clinical trial where some of them started to take lithium and some didn’t, after which the outcomes were compared.)

Finally, I thought this methodology for tagging images with your mind was pretty neat. This particular type of mind reading does not seem to have reached a high specificity and sensitivity yet, but that will improve in time.

Mass e-epidemiology

The LifeGene project, which was recently started in Sweden, may in due time generate one of the most complex and interesting data sets ever. The project will study health, lifestyle and genetics (and much more) in the long term in a cohort of 500.000 (this is not a typo!) individuals. Participants will donate blood samples and be subjected to physical measurements (waist and hip circumference, blood pressure etc), but for a smaller subset of participants the study will really go deep, with global analysis of DNA, RNA, protein, metabolite and toxin levels, as well as epigenomics (simplifying a bit, this means genomic information that is not directly encoded in the DNA sequence). Two testing centres have opened during the fall – one in Stockholm and, more recently, one in Umeå.

Environmental factors will be examined too: “Exposures such as diet, physical activity, smoking, prenatal environment, infections, sleep-disorders, socioeconomic and psychosocial status, to name a few, will be assessed.” The data collection will be done through for instance mobile phones and the web, with sampling rates adjusted based on age and life events. The project consortium calls the approach e-epidemiology.

This might make each participant feel a bit like David Ewing Duncan, the man who decided to try as many genetic, medical and toxicological test on himself as he could, and wrote a book about it. Will they suffer from information overload from self-related data? For the statisticians involved, information overload is a certainty. It will be a tough – but interesting – task to collect, store and mine these data. But exactly this kind of project, which relates hereditary factors to environment and lifestyle and correlates these to outcomes (like disease states), is much needed.

The fourth paradigm

A new book about science in the age of big data, Fourth Paradigm: Data-Intensive Scientific Discovery, is available for downloading (for free). The book was reviewed in Nature today. It’s written by people from Microsoft Research and has a foreword by Gordon Bell, one of the authors of Total Recall: How the E-memory Revolution Will Change Everything.

Individualized cancer research

I have been intrigued for some time by Jay Tenenbaum‘s idea to forget about clinical cancer trials and focus on deep DNA and RNA (and perhaps protein) profiling of individual patients in order to optimize a treatment especially for the given patient. (See e.g. this earlier blog post about his company, CollabRx.)

Tenenbaum and Leroy Hood of the Institute for Systems Biology recently wrote about their ideas in an editorial called A Smarter War on Cancer:

One alternative to this conventional approach would be to treat a small number of highly motivated cancer patients as individual experiments, in scientific parlance an “N of 1.” Vast amounts of data could be analyzed from each patient’s tumor to predict which proteins are the most effective targets to destroy the cancer. Each patient would then receive a drug regimen specifically tailored for their tumor. The lack of “control patients” would require that each patient serve as his or her own control, using single subject research designs to track the tumor’s molecular response to treatment through repeated biopsies, a requirement that may eventually be replaced by sampling blood.

This sounds cool, but my gut feeling has been that it’s probably not a realistic concept yet. However, I came across a blogged conference report that suggests there may be some value in this approach already. MassGenomics writes about researchers in Canada who decided to try to help an 80-year-old patient with a rare type of tumor (an adenocarcinoma of the tongue). This tumor was surgically removed but metastasized to the lungs and did not respond to the prescribed drug. The researchers then sequenced the genome (DNA) and transcriptome ([messenger] RNA) of the tumor and a non-tumor control sample. They found four mutations that had occurred in the tumor, and also identified a gene that had been amplified in the tumor and against which there happened to be a drug available in the drug bank. Upon treatment with this drug, all metastases vanished – but unfortunately came back in a resistant form several months later. Still, it is encouraging to see that this type of genome studies can be used to delay the spread of tumors, even if just for a couple of months.

A while back, MIT Technology Review wrote about a microfluidic chip which is being used in a clinical trial for prostate cancer. This chip from Fluidigm is meant to analyze gene expression patterns in rare tumor cells captured from blood samples. It is hoped that the expression signatures will be predictive of how different patients respond to different medications. Another microfluidic device from Nanosphere has been approved by the U.S. Food and Drug Administration to be used to “…detect genetic variations in blood that modulate the effectiveness of some drugs.” This would take pharmacogenomics – the use of genome information to predict how individuals will respond to drugs – into the doctor’s office.

“You could have a version of our system in a molecular diagnostics lab running genetic assays, like those for cystic fibrosis and warfarin, or in a microbiology lab running virus assays, or in a stat lab for ER running tests, like the cardiac troponin test, a biomarker to diagnose heart attack, and pharmacogenomic testing for [Plavix metabolism],” says [Nanosphere CEO] Moffitt.

Update 10 Dec:

(a) Rick Anderson commented on this post and pointed to Exicon, a company that offers, among other things, personalized cancer diagnostics based on micro-RNA biomarkers.

(b) Via H+ magazine,  I learned about the Pink Army Cooperative, who do “open source personal drug development for breast cancer.” They want to use synthetic biology to make “N=1 medicines”, that is, drugs developed for one person only. They “…design our drugs computationally using public scientific knowledge and diagnostic data collected from the individual to be treated.”

Data services

There’s been a little hiatus here as I have been traveling. I recently learned that Microsoft has launched Codename “Dallas”, a service for purchasing and managing datasets and web services. It seems they are trying to provide consistent APIs to work with different data from the public and private sectors in a clean way. There’s an introduction here.

This type of online data repository seems to be an idea whose time has arrived – I have previously talked about resources like Infochimps, Datamob and Amazon’s Public Data Sets, and there is also, which I seem to have forgotten to mention. A recent commenter on this blog pointed me to the comprehensive knowledge archive network, which is a “registry of open data and content packages”. Then there are the governmental and municipal data repositories, such as

Another interesting service, which may have a slightly different focus, is Factual, described by founder Gil Elbaz as a “platform where anyone can share and mash open data“. Factual basically wants to list facts, and puts the emphasis on data accuracy, so you can express opinions on and discuss the validity of any piece of data. Factual also claims to have “deeper data technology” which allows users to explore the data in a more sophisticated way compared to other services like the Amazon Open Data Sets, for instance.

Companies specializing in helping users make sense out of massive data sets are, of course, popping up as well. I have previously written about Good Data, and now the launch of a new seemingly similar company,  Data Applied, has been announced.  Like Good Data, Data Applied offers affordable licenses for cloud-based and social data analysis, with a free trial package (though Good Data’s free version seems to offer more – a 10 MB data warehouse and 1-5 users vs Data Applied’s file size of <100 kb for a single user; someone correct me if I am wrong). The visualization capabilities of Data Applied do seem very nice. It’s still unclear to me how different the offerings of these two companies are but time will tell.

Post Navigation