Follow the Data

A data driven blog

Archive for the tag “statistics”

Cool statisticians

I think this New York Times article is the third one I’ve read during the past few months claiming that statisticians will be the big winners on the job market in the next ten years. I wonder whether this trend will largely be confined to the big U.S. giants like Google, IBM and so on, or if it will bleed over to other countries, but I tend to think it will. As economist Erik Brynjolfsson of MIT’s Center for Digital Business says in the article:

“We’re rapidly entering a world where everything can be monitored and measured,” said Erik Brynjolfsson, an economist and director of the Massachusetts Institute of Technology’s Center for Digital Business. “But the big problem is going to be the ability of humans to use, analyze and make sense of the data.”

… and these problems are going to be felt almost everywhere. By the way, Brynjolfsson’s quote is a pretty good description of the themes I want to explore in this blog.

Reverse engineering social security numbers

The latest issue of PNAS (Proceedings of the National Academy of Sciences of the United States of America; a well-known scientific journal) contains two interesting pieces of statistical analysis. Luckily, they are both freely downloadable even if you don’t have access to a subscription.

Predicting Social Security numbers from public data claims that USA:s social security numbers (SSN), which are supposed to be confidential, are actually to a certain extent predictable, at least for younger people, given information such as birth date and location. Basically, the authors (from Carnegie Mellon university) have tried to reverse-engineer the SSN assignment process using available information about this process, including the so-called SSA Death Master File which is publicly available and contains data about SSN assignments for people who have been reported as dead.

The authors detected various correlations between e.g. date of birth and all the nine digits in the SSN, and eventually (after much visual inspection and several rounds of model refinement) constructed a regression model for predicting digits in an SSN based on birth date. They managed to correctly predict the SSN of 8.5% of deceased individuals in less than 1,000 tries.

Naturally, this suggests possibilities for e.g. identity theft and poses the question whether social security numbers should be replaced by something else.

Another study in the latest PNAS, NIH funding trajectories and their correlations with US health dynamics from 1950 to 2004, suggests that funding of research relating to certain diseases leads to a time-lagged decrease in deaths due to those diseases – in other words, the research appears pay off with a time lag. In order to do their analysis, the authors compiled data on NIH (the US National Institutes of Health) funding starting in 1937 and compared those to mortality data for cardiovascular disease, stroke, cancer, and diabetes.

Post Navigation