Food and health data set
I stumbled into an amazing dataset about food and health, available online here (Google spreadsheet) and described at the Canibais e Reis blog. I found it through the Cluster analysis of what the world eats blog post, which is cool, but which doesn’t go into the health part of the dataset. By the way, the R code used that blog post is useful for learning how to plot things onto a map of the world in R (and it calculates the most deviant food habits in Mexico and USA as a bonus). Also note the first line:
which reads the data set directly from an URL into an R data structure, ready to be manipulated. I think it’s pretty neat, but then I am easily impressed.
The Canibais e Reis author was interested in data on the relationship between nutrition, lifestyle and health worldwide, but those data were dispersed over various sources and used different formats. He therefore (heroically) combined information from sources like the FAO Statistical Yearbook (for world nutrition data), the British Heart Foundation (for world heart-related, diabetes, obesity, cholesterol etc. disease statistics) and the WHO Global Health Atlas and WHO Statistical Information System (for general world health statistics like mortality, sanitation, drinking water, etc.) After cleaning up the data set and removing incomplete entries, he ended up with a complete matrix of 101 nutrition, health and lifestyle variables for 86 countries. Let the mining begin!
As the blog post describing the data points out, there’s bound to be a lot of confounding variables and non-independence in the data set, so it would be a good idea to apply tools like PCA (see e.g. the recent article Principal Components for Modeling), canonical correlation analysis or something similar to it as a pre-processing step. I haven’t had time to do more than fiddle around a bit – for example, I ran a quick PCA on the food related part of the matrix to try to find out the major direction of variation in world diets. The first principal component (which, at 19.8%, is not very dominant) reflects a division between rice eating countries and “meat and wheat” countries with high consumption of animal products, wheat, meat and sugar.
Canibais e Reis provides a dynamic Excel file where some different types of analysis have been performed. It’s fun to explore the unexpected correlations (or absent correlations) that pop up (the worksheets BEST and WORST in the Excel file). One surprising finding that emerges is that cholesterol is not correlated to cardiovascular disease across this data set (in fact there is a slight negative correlation).
My favourite finding, though, is that cheese consumption is not correlated to death from non-communicable diseases or cardiovascular diseases. Those correlations may be massively influenced by confounding variables, but they are negative enough that I choose to continue chomping on those cheeses …