Follow the Data

A data driven blog

Archive for the month “February, 2013”

“The secret of the big guys” (from FastML)

FastML has an intriguing post called The secret of the big guys, which starts thus:

Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods, you can get better results than from a random forest. And maybe even faster.

It definitely sounds worth trying, and there are links to some papers with Andrew Ng as a co-author in the blog post. I haven’t had time to try out Sofia-ML yet, or to implement this approach by hand, but I’ll definitely give it a shot at some distant point in time when I don’t have my hands full.


Cost-sensitive learning

I have been looking at a binary classification problem (basically a problem where you are supposed to predict “yes” or “no” based on a set of features) where the cost of misclassifying a “yes” as a “no” is much more expensive than misclassifying a “no” as a “yes”.

Searching the web for hints about how to approach this kind of scenario, I discovered that there are some methods explicitly designed for this, such as MetaCost [pdf link] by Domingos and cost-sensitive decision trees, but I also learned that there are a couple of very general relationships that apply for a scenario like mine.

In this paper (pdf link) by Ling and Sheng, it is shown that if your (binary) classifier can produce a posterior probability estimate for predicting (e g) “yes” in a test set, then one can make that classifier cost-sensitive simply by choosing the classification threshold (which is often taken as 0.5 in non-cost-sensitive classifiers) according to p_threshold = FP / (FP + FN), where FP is the false positive rate and FN is the false negative rate. Equivalently, one can “rebalance” the original samples by sampling “yes” and “no” examples proportionally so that p_threshold becomes 0.5. That is, the prior probabilities of the “yes” and “no” classes and the costs are interchangeable.

So in principle, one could either manipulate the classification threshold or the training set proportions to get a cost-sensitive classifier. Of course, further adjustment may be needed if the classifier you are using does not produce explicit probability estimates.

The paper is worth reading in full as it shows clearly why these relationships hold and how they are used in various cost-sensitive classification methods.

P.S. This is of course very much related to the imbalanced class problem that I wrote about in an earlier post, but at that time I was not thinking that much about the classification-cost aspect yet.

Industrial postdoc position in genomics & big data (Stockholm)

There is an interesting postdoc position available at AstraZeneca in Mölndal, but located at SciLifeLab in Stockholm. This (bioinformatics) position is about next generation sequencing and data integration, with a definite “big data” slant from what I have heard. Check it out!


Post Navigation