Follow the Data

A data driven blog

Archive for the tag “Google”

Data data data

I haven’t used Google Plus much since I signed up this summer but that is changing now after they launched the “communities” concept and I found the Data data data and Machine Learning communities, where a lot of interesting discussions can be found by “big names” and “smart unknowns” alike. Check them out if you haven’t done so.

Machine learning

While preparing for our next podcast recording, here are some interesting recent machine learning developments.

The Protocols and Structures for Inference (PSI) project aims to develop an architecture for presenting machine learning algorithms, their inputs (data) and outputs (predictors) as resource-oriented RESTful web services in order to make machine learning technology accessible to a broader range of people than just machine learning researchers.

Why?

Currently, many machine learning implementations (e.g., in toolkits such as Weka, Orange, Elefant, Shogun, SciKit.Learn, etc.) are tied to specific choices of programming language, and data sets to particular formats (e.g., CSV, svmlight, ARFF). This limits their accessability [sic], since new users may have to learn a new programming language to run a learner or write a parser for a new data format, and their interoperability, requiring data format converters and multiple language platforms.

I think it seems promising. The specification is here.

  • BigML, which has been mentioned in passing on this blog, has now published some videos of what the interface actually looks like. It seems quite nice. While watching the videos, I was thinking “OK, this looks really nice, but does it have an API?” Luckily, it turns out that it has, which is good news for us geekier people who don’t just want to use the GUI.
  • Machine learning in Google Goggles. A video describing some real cutting-edge ML research in Google’s augmented reality glasses, Google Goggles. Definitely worth checking out.

Google Prediction API open to all

I’ve been eagerly waiting to use the Google Prediction API ever since it was announced, and now (since sometime in May) it’s open for everyone who has a Google account (and a credit card). Previously, you had to be able to provide a U.S. mailing address.

Google’s Prediction API is basically a nice way to run your classification and/or prediction tasks through Google’s black-box set of machine learning tools. The way it works is that you upload your training data to Google Storage, which is something like Google’s version of Amazon’s S3: a cloud-based storage system where you store your data in “buckets”. (Google Storage, like S3, uses the term bucket and, also like S3, requires that bucket names only use lower-case letters.) You can activate both Google Storage and the Prediction API from the Google APIs Console. This is also where you will find (click “API access” on the left hand menu) the access key that you will need to run prediction tasks. You’ll have to give credit card details to pay for potential future usage.

The training examples that you put in Storage need to be formatted according to the specification in the Developer’s Guide. Once they have been uploaded, you can train a model on the uploaded data, make predictions about new examples, update existing models and more using one of the client libraries or even simpler, just by copying some of the bash scripts shown on the same page (hidden behind ‘+’ signs which can be expanded.) For these bash scripts to work as written on that page, you need to paste your API key into a file called ‘googlekey’ located in the directory from where you are running the script.

I used this walkthrough example about cancer classification from gene expression data to get up to speed on how Google Prediction API works. Now I’m thinking about what data to throw at it next. Perhaps it would be fun to input some Kaggle contest data sets into it as a kind of “Google baseline” predictor? :-)

Far-out stuff

Some science fiction-type nuggets from the past few weeks:

Google does machine learning using quantum computing. Apparently, a “quantum algorithm” called Grover’s algorithm can search an unsorted database in O(√N) time. The Google blog explains this in layman’s terms:

Assume I hide a ball in a cabinet with a million drawers. How many drawers do you have to open to find the ball? Sometimes you may get lucky and find the ball in the first few drawers but at other times you have to inspect almost all of them. So on average it will take you 500,000 peeks to find the ball. Now a quantum computer can perform such a search looking only into 1000 drawers.

I’ve absolutely no clue how this algorithm works – although I did take an introductory course in quantum mechanics many a moon ago, I’ve forgotten everything about it and the course probably didn’t go deep enough to explain it anyway. Google are apparently collaborating with a Canadian company called D-Wave, who develop hardware for realizing something called a “quantum adiabatic algorithm” by “magnetically coupling superconducting loops”. It is interesting that D-Wave are explicitly focusing on machine learning; the home page states that “D-Wave is pioneering the development of a new class of high-performance computing system designed to solve complex search and optimization problems, with an initial emphasis on synthetic intelligence and machine learning applications.”

Speaking of synthetic intelligence, the winter issue of H+ Magazine contains an article by Ben Goertzel where he discusses the possibility that the first artificial general intelligence will arise in China. The well-known AI researcher Hugo de Garis, who runs a lab in Xiamen in China, certainly believes that this will happen. In his words:

China has a population of 1.3 billion. The US has a population of 0.3 billion. China has averaged an economic growth rate of about 10% over the past 3 decades. The US has averaged 3%. The Chinese government is strongly committed to heavy investment into high tech. From the above premises, one can virtually prove, as in a mathematical theorem, that China in a decade or so will be in a superior position to offer top salaries (in the rich Southeastern cities) to creative, brilliant Westerners to come to China to build artificial brains — much more than will be offered by the US and Europe. With the planet‘s most creative AI researchers in China, it is then almost certain that the planet‘s first artificial intellect to be built will have Chinese characteristics.

Some other arguments in favor of this idea mentioned in the article are that “One of China‘s major advantages is the lack of strong skepticism about AGI resulting from past failures” and that China “has little of the West‘s subliminal resistance to thinking machines or immortal people”.

(By the way, the same issue contains a good article by Alexandra Carmichael on subjects frequently discussed on this blog. The most fascinating detail from that article, to me, was when she mentions “self-organized clinical trials“; apparently users of PatientsLikeMe with ALS had set up their own virtual clinical trial where some of them started to take lithium and some didn’t, after which the outcomes were compared.)

Finally, I thought this methodology for tagging images with your mind was pretty neat. This particular type of mind reading does not seem to have reached a high specificity and sensitivity yet, but that will improve in time.

Predictions from Google search data

Google has started reporting some interesting findings about predictions based on web search data. I would guess that these things have been in the works for several years before Google went public with them.

Last year, they introduced Google Flu Trends, which basically monitors influenza-related searches and tries to predict outbreaks early by identifying geographical location that are suddenly showing a strong increase in such searches. An article describing the system was even published in the very high-profile scientific journal Nature. (Later, people started to use Twitter for flu monitoring.)

Lately, the official Google research blog has started to write about the possibilities of using Google search data to predict economic variables in the short term. A recent analysis they did, based on claims for unemployment benefits  in the U.S., seems to suggest that the U.S. economy is recovering.

From the blog post:

One of the strongest leading indicators of economic activity is the number of people who file for unemployment benefits. Macroeconomists Robert Gordon and James Hamilton have recently examined the historical evidence. According to Hamilton’s summary: “…in each of the last six recessions, the recovery began within 8 weeks of the peak in new unemployment claims.”

Let’s see if the prediction comes true!

The analysis is described in more detail in this paper.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 46 other followers