Follow the Data

A data driven blog

Three fun new Google tools

Far be it from me to be a Google cheerleader, but they’ve recently released three interesting new products that make me a bit impressed.

The first one is the Books Ngram Viewer. This is linked to a fresh Science article, Quantitative Analysis of Culture Using Millions of Digitized Books. Google has gotten quite far in digitizing, indexing and mining the contents of a substantial part of all available books, and the Science article presents a kind of first-pass or birds-eye-view analysis. Calling this the “culturome”, as they have done, is a bit silly, but the amazing data sets with N-gram data they have made available compensate amply for that lapse of taste. These data are not only fun to play with in the N-gram viewer, they will also be extremely useful as a background data set for text analysis applications and for linguistics in general. In the Ngram viewer, you can create timelines for the occurrence of single words or phrases of up to five words (5-grams). I’ve only played around a little, but I’ve already been baffled by the fact that Norway is mentioned so often in English literature between 1900 and 1950 compared to its Nordic neighbours and wondered about the blip you get around the year 1900 when searching for “DNA” and “gene” (DNA’s role as the hereditary material was discovered in 1953, and this is reflected in the graph, but why is it mentioned more often and seemingly in correlation with “gene” in 1900?). I suppose the latter may be some kind of dating artifact, where books dated ’00 are assigned to 1900 instead of 2000, or something like that.

The second new Google tool is the Body Browser. I needed to install Chrome to be able to use this, but that’s not much of a hassle. The Body Browser is simply an interactive anatomical model of a human body. You can zoom in and out and activate or inactivate display layers representing the nervous system, the circulatory system, bones and muscles. You can also search for body parts (with auto-completion, which helps with hard-to-spell Latin terms). A nifty tool.

Finally, I’ve been playing around a bit with Google Refine. This “power tool for messy data” used to be called Gridworks back when it was being developed by Metaweb, which was later purchased by Google. This tool is pretty great for cleaning up data sets afflicted with things like inconsistent labels and lots of spelling errors. The tool allows you to cluster entries in a column and thus spot entries that were supposed to be identical but differ because of spelling errors or other inconsistencies introduced by whoever input the data.

Advertisements

Single Post Navigation

2 thoughts on “Three fun new Google tools

  1. For an explanation of the blips around 1900 (they occur in 1899 and 1905) for modern terms, see: http://www.newscientist.com/blogs/shortsharpscience/2010/12/our-adventures-in-culturomics.html

    It’s due to artefacts in the publication date metadata.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: