Follow the Data

A data driven blog

Archive for the tag “data-mining”

MetaOptimize Q+A

The MetaOptimize site has a nice StackOverflow-style question & answer community dedicated to machine learning, data mining, natural language processing and the like. It seems to have gotten off to a nice start. Here, you can enquire about things like the best freely available machine learning textbooks or how to set up Hadoop on your office machine, or more technical details such as whether subsampling biases the ROC-AUC score.

TR personalized medicine briefing

MIT’s Technology Review magazine has a briefing on personalized medicine. It’s worth a look, although it’s quite heavily tilted towards DNA sequencing technology (which I am interested in, but there is a lot more to personalized medicine). Not surprisingly, one of the articles in the briefing makes the point that the biggest bottleneck in personalized medicine will be data analysis, the risk being that “…we will end up with a collection of data … unable to predict anything.” (As an aside, I would be moderately wealthy if I had a euro for each time I’d read the phrase “drowning in data”, which appears in the article heading. I think I even rejected that as a name for this blog. It would be nice to see someone come up with a fresh alternative verb to “drowning” …)

Technology Review also has a piece on how IBM has started to put their mathematicians to work in business analytics. They mention a neat technique I hadn’t been aware of: “…they used a technique called high-quantile modeling–which tries to predict, say, the 90th percentile of a distribution rather than the mean–to estimate potential spending by each customer and calculate how much of that demand IBM could fulfill“.

The last part of the article talks about a very interesting problem: how to model a system where output from the model itself affects the system, or as the article puts it “…situations where a model must incorporate behavioral changes that the model itself has inspired“. I’m surprised the article doesn’t mention the obvious applicability of this to the stock market, where of course thousands of professional and amateur data miners use prediction models (their own and others’) to determine how they buy and sell stocks. Instead, its example comes from traffic control:

For example, […] a traffic congestion system might use messages sent to GPS units to direct drivers away from the site of a highway accident. But the model would also have to calculate how many people would take its advice, lest it end up creating a new traffic jam on an alternate route.

Data services

There’s been a little hiatus here as I have been traveling. I recently learned that Microsoft has launched Codename “Dallas”, a service for purchasing and managing datasets and web services. It seems they are trying to provide consistent APIs to work with different data from the public and private sectors in a clean way. There’s an introduction here.

This type of online data repository seems to be an idea whose time has arrived – I have previously talked about resources like Infochimps, Datamob and Amazon’s Public Data Sets, and there is also, which I seem to have forgotten to mention. A recent commenter on this blog pointed me to the comprehensive knowledge archive network, which is a “registry of open data and content packages”. Then there are the governmental and municipal data repositories, such as

Another interesting service, which may have a slightly different focus, is Factual, described by founder Gil Elbaz as a “platform where anyone can share and mash open data“. Factual basically wants to list facts, and puts the emphasis on data accuracy, so you can express opinions on and discuss the validity of any piece of data. Factual also claims to have “deeper data technology” which allows users to explore the data in a more sophisticated way compared to other services like the Amazon Open Data Sets, for instance.

Companies specializing in helping users make sense out of massive data sets are, of course, popping up as well. I have previously written about Good Data, and now the launch of a new seemingly similar company,  Data Applied, has been announced.  Like Good Data, Data Applied offers affordable licenses for cloud-based and social data analysis, with a free trial package (though Good Data’s free version seems to offer more – a 10 MB data warehouse and 1-5 users vs Data Applied’s file size of <100 kb for a single user; someone correct me if I am wrong). The visualization capabilities of Data Applied do seem very nice. It’s still unclear to me how different the offerings of these two companies are but time will tell.

Mining data streams, the web, and the climate

I recently came across MOA (Massive Online Analysis), an environment for what its developers call massive data mining, or data stream mining. This New Zealand-based project is related to Weka, a Java-based framework for machine learning which I’ve used quite a bit over the years. Data stream mining differs from plain old data mining in that the data is assumed to arrive quickly and continuously, as in a stream, and in an unpredictable order. Therefore the full data set will typically be many times larger than your computer’s memory (which already rules out some commonly used algorithms), and each example can only be briefly examined once, after which it is discarded. Therefore the statistical model has to be updated incrementally, and often must be ready to be applied at any point between training examples.

I also came across a press release describing version 2.0 of KnowledgeMiner for Excel, a data mining software apparently used by customers like Pfizer, NASA and Boeing, and which is based on GMDH (Group Method of Data Handling), a paradigm I hadn’t heard about before. I failed to install KnowledgeMiner for Excel for my Mac due to some obscure install error, but from what I gather, the GMDH framework involves a kind of automatic model selection, making it easier to use for non-experts in data mining. (Of course I haven’t tried it, so it’s hard to evaluate the claim.) The example data set provided with the software package has to do with climate data and modeling, so it should be fun to try as soon as I get it working:

The new KnowledgeMiner is now capable of high-dimensional modeling and prediction of climate and has an included example using air and sea surface temperature data. This is a first for a data-mining software package: to offer anyone the ability to see for themselves that global temperatures are rising steadily, using publicly available data. The biggest surprise is seeing that the changes are greatest and accelerating in the northern latitudes. By using data from the past, KnowledgeMiner (yX) can show predictions for future years. Go to this link to see the climate change data displayed graphically in a slideshow through the year 2020:

There’s also an interesting new toolkit for web mining from BixoLabs. They’ve built what they call an elastic web mining platform in Amazon’s Elastic Compute Cloud (on top of Hadoop, Cascading and a web mining framework called Bixo, for those of you who care). The whole thing is pre-configured and scalable, and from the tutorials on the site, it seems pretty easy to set it up to crawl the web to your heart’s content.


New Technique Identifies Versions Of The Same Song. I’d love to see this evaluated on Jamaican dancehall/reggae versions (songs sharing the same riddim). They tested their algorithm on “Day Tripper” as performed by Beatles and another band, but how would it have fared on the Diwali riddim album, for instance?

New Emergency Medical Service System May Predict Caller’s Fate. A paper in BMC Emergency Medicine shows that a computer algorithm is able to predict the patient’s risk of dying at the time of the emergency call. The system described in the paper is used in Yokohama and was tested from 1st October 2008 until 31st March 2009, collecting information from over 60,000 emergency calls. The system can be used to prioritise ambulance responses according to the severity of the patient’s condition.

The Pentagon is performing massive data mining on teens and young adults for military recruitment purposes.

Deep data dive: Swine flu prompts big sales of hand sanitizers. A company called Panjiva is analyzing global trade data to find out real-time information affecting supply and demand changes.

This & that

Netflix prize soon to be awarded

A few years back, an online movie rental site, NetFlix, promised a million dollars to anyone who could improve their recommendation algorithm by a certain percentage (10%). In the same way that Amazon recommends books you might be interested in, NetFlix recommends movies to users.

The NetFlix prize, as the challenge was called, turned out to be more important for data mining and related fields than at least I had anticipated. Many teams from all over the world put in enormous efforts to shave off the last percentage points needed to cross the finish line. At times, it was thought that the teams would be doomed to approaching the cut-off point asymptotically, never actually crossing it.

Last week, though, a team called BellKor did achieve the target percentage. That does not yet make them the winners, as other teams still have the chance to submit a better model before July 26.  Still, it is likely that they will win. At any rate, someone will have won, which is significant.

Although I have not followed the competition in much depth, I believe some of the lessons learned were about the surprising power of using multiple – and I mean many – predictive models. Some of the best predictors were linear combinations of over a hundred different kinds of models.

In particular, it turned out that methods based on “latent factors” in the data – regularities that can be fished out using so-called matrix factorization methods such as SVD or NMF – were very powerful tools for this application, especially when combined with “neighbourhood-based” methods, which basically make predictions by assuming that similar users (based on how they have previously rated films) will like similar films, or conversely, that films that have been similarly rated by different users will tend to be similar.

One of the members of the winning team BellKor, Yehuda Koren, has recently published an interesting paper which outlines strategies for the hard problem of accounting for preferences that change over time. Elements of these are likely to have been included in the prediction model that crossed the 10% line.

There is a lot to chew on in the paper for the more technically inclined, but I would just like to mention a simple and interesting trend Koren found in the Netflix data: that older movies tend to get higher ratings that newer ones.

Koren sets up two hypotheses that could explain this phenomenon:

either, the customers will more readily choose to rent a new movie simply based on novelty value, while they would only choose and older movie after a careful selection process, which would lead to a greater likelihood of enjoying the movie,

or, old movies are just better!

I can think of a couple of other possible explanations as well, such as the effects of nostalgia for movies seen a long time ago for instance, but anyway Koren goes on to compare these two hypotheses using the statistics in the Netflix database. By interpreting parameters in the statistical model he has set up, he concludes that the first explanation (a more careful selection process for old movies) is the more likely one.

Reverse engineering social security numbers

The latest issue of PNAS (Proceedings of the National Academy of Sciences of the United States of America; a well-known scientific journal) contains two interesting pieces of statistical analysis. Luckily, they are both freely downloadable even if you don’t have access to a subscription.

Predicting Social Security numbers from public data claims that USA:s social security numbers (SSN), which are supposed to be confidential, are actually to a certain extent predictable, at least for younger people, given information such as birth date and location. Basically, the authors (from Carnegie Mellon university) have tried to reverse-engineer the SSN assignment process using available information about this process, including the so-called SSA Death Master File which is publicly available and contains data about SSN assignments for people who have been reported as dead.

The authors detected various correlations between e.g. date of birth and all the nine digits in the SSN, and eventually (after much visual inspection and several rounds of model refinement) constructed a regression model for predicting digits in an SSN based on birth date. They managed to correctly predict the SSN of 8.5% of deceased individuals in less than 1,000 tries.

Naturally, this suggests possibilities for e.g. identity theft and poses the question whether social security numbers should be replaced by something else.

Another study in the latest PNAS, NIH funding trajectories and their correlations with US health dynamics from 1950 to 2004, suggests that funding of research relating to certain diseases leads to a time-lagged decrease in deaths due to those diseases – in other words, the research appears pay off with a time lag. In order to do their analysis, the authors compiled data on NIH (the US National Institutes of Health) funding starting in 1937 and compared those to mortality data for cardiovascular disease, stroke, cancer, and diabetes.

Post Navigation