Follow the Data

A data driven blog

Archive for the day “February 25, 2012”

New analysis competitions

Some interesting competitions in data analysis / prediction:

Kaggle is managing this year’s KDD Cup, which will be about Weibo, China’s rough equivalent to Twitter (with more support for adding pictures and comments on posts, it’s more like a hybrid between Twitter and Facebook maybe). There will be two tasks, (1) predicting which users a certain user will follow (all data being anonymized, of course), and (2) predicting click-through rate in online computational ad systems. According to Gordon Sun, chief scientist at Tencent (the company behind Weibo), the data set to be used is the largest one ever to have been released for competitive purposes.

CrowdAnalytix, an India-based company with a business idea similar to Kaggle’s, has started a fun quickie competition about sentiment mining. Actually the competition might already be over as it ran for just 9 days starting 16/2. The input consists of comments left by visitors to a major airport in India, and the goal is to identify and compile actionable and/or interesting information, such as what kind of services visitors think are missing.

The Clarity challenge is, for me, easily the most interesting challenge of the three, in that it concerns the use of genomic information in healthcare. This challenge (with a prize sum of $25,000) is, in effect, crowdsourcing genomic/medical research (although only 20 teams will get selected to participate). The goal is to identify and report on potential genetic features underlying medical disorders in three children, given the genome sequences of the children and their parents. These genetics features are presently unknown, which is why this competition really represents something new in medical research. I think this is a very nice initiative, in fact I had thought of initiating something similar at my own institute where I work, but this challenge is much better than what I had in mind. It will be very interesting to see what comes out of it.

Wolfram Alpha Pro trial

I was  pretty intrigued when I read this blog post about the new “Pro” version of Wolfram Alpha, with passages like:

The key idea is automation. The concept in Wolfram|Alpha Pro is that I should just be able to take my data in whatever raw form it arrives, and throw it into Wolfram|Alpha Pro. And then Wolfram|Alpha Pro should automatically do a whole bunch of analysis, and then give me a well-organized report about my data. And if my data isn’t too large, this should all happen in a few seconds.

And what’s amazing to me is that it actually works.

So I signed up for an account; at $5 a month (introductory price), I would have been willing to pay for a few months just to try it out, but as it happens, they also have a free 2-week trial which I duly activated. Now I was looking forward to those cool automatic PCA plots and linear regression auto-magically appearing upon uploading my data …

The first letdown is that there is a 1-megabyte limit to the data upload, so I guess we can safely say that Wolfram Alpha Pro is not a “big data” thing … Joking aside, 1 Mb is really not enough to make this service anything more than a toy analytics sandbox; at least for me, I would need to subsample practically all of the datasets I work with to even be able to upload the data for analysis.

Still: the output shown in the blog looked cool, so I tried to upload a few files, with the following results:

1. A CSV file that came from screen-scraping some Kaggle leaderboards for a visualization I was planning to do with Joel but which we never bothered to finish. This CSV file is a bit “dirty” (lots of missing values, some rows are longer than others) so I wasn’t expecting a clean import, but what actually happens is that Wolfram Alpha Pro just gets stuck for ever, with a window saying “Processing file.csv.” I’ve tried this three times now with the same result. This is a bit annoying; if the input isn’t looking clean, it would be more useful to get an error message telling you that the data could not be processed.

2. A CSV file from an old data set about gene expression in neural stem cells which I used to work on. This is a perfectly ordinary CSV file with a regular matrix structure, a header line, and no missing values; it can be readily imported into R without errors, for instance (read.csv(“file.txt”) works fine). However, after uploading, I get the message “Wolfram|Alpha doesn’t know how to interpret your input.”

3. OK, so no luck with the CSV files, let’s try a tab separated file. This time I tried a table of protein complex abundance values in a certain type of cancer, something we are working on for a paper. It was successfully imported into Wolfram Alpha, but the “analysis” I got had completely missed that it was a numeric data set, and only gave me information about word counts, character frequencies, frequency of capitalized words etc. The structure of the file is that it starts with two comment lines (beginning with “#”, as is the custom), after which all lines are tab separated with a numeric ID in the first column, a complex name (which can consist of several words) in the second column, followed by eight columns with numeric values (complex abundances in different sets of tissues). Apparently the second column is enough to throw off the system.

4. Last try. Let’s give it something more standard: a tab-separated file with a header line and the other lines consisting of a cell ID in the first column followed by all numerical values. That is, the first column consists of (one-word) ID’s; the rest is numeric. (This is from an old project on single-cell gene expression.) This is a very common way to format tables in flat text files. But again, I get a textual analysis with overrepresented words etc. I guess I need to remove the ID column. (*removing the ID column*) No, it didn’t help, I still get the textual analysis although all of my values except the first line (column names) are numerical.

I’m sure I’m doing something wrong, but the point is I shouldn’t need to worry about these things, given what the product claims to be able to do …

Still looking forward to exploring Wolfram Alpha Pro when I’ve figured out what formats it can work with!

P.S. Follow the Data will be launching a podcast series in a few weeks – stay tuned! We’re very excited about that.

Post Navigation