The latest issue of PNAS (Proceedings of the National Academy of Sciences of the United States of America; a well-known scientific journal) contains two interesting pieces of statistical analysis. Luckily, they are both freely downloadable even if you don’t have access to a subscription.
Predicting Social Security numbers from public data claims that USA:s social security numbers (SSN), which are supposed to be confidential, are actually to a certain extent predictable, at least for younger people, given information such as birth date and location. Basically, the authors (from Carnegie Mellon university) have tried to reverse-engineer the SSN assignment process using available information about this process, including the so-called SSA Death Master File which is publicly available and contains data about SSN assignments for people who have been reported as dead.
The authors detected various correlations between e.g. date of birth and all the nine digits in the SSN, and eventually (after much visual inspection and several rounds of model refinement) constructed a regression model for predicting digits in an SSN based on birth date. They managed to correctly predict the SSN of 8.5% of deceased individuals in less than 1,000 tries.
Naturally, this suggests possibilities for e.g. identity theft and poses the question whether social security numbers should be replaced by something else.
Another study in the latest PNAS, NIH funding trajectories and their correlations with US health dynamics from 1950 to 2004, suggests that funding of research relating to certain diseases leads to a time-lagged decrease in deaths due to those diseases – in other words, the research appears pay off with a time lag. In order to do their analysis, the authors compiled data on NIH (the US National Institutes of Health) funding starting in 1937 and compared those to mortality data for cardiovascular disease, stroke, cancer, and diabetes.