Mining data streams, the web, and the climate
I recently came across MOA (Massive Online Analysis), an environment for what its developers call massive data mining, or data stream mining. This New Zealand-based project is related to Weka, a Java-based framework for machine learning which I’ve used quite a bit over the years. Data stream mining differs from plain old data mining in that the data is assumed to arrive quickly and continuously, as in a stream, and in an unpredictable order. Therefore the full data set will typically be many times larger than your computer’s memory (which already rules out some commonly used algorithms), and each example can only be briefly examined once, after which it is discarded. Therefore the statistical model has to be updated incrementally, and often must be ready to be applied at any point between training examples.
I also came across a press release describing version 2.0 of KnowledgeMiner for Excel, a data mining software apparently used by customers like Pfizer, NASA and Boeing, and which is based on GMDH (Group Method of Data Handling), a paradigm I hadn’t heard about before. I failed to install KnowledgeMiner for Excel for my Mac due to some obscure install error, but from what I gather, the GMDH framework involves a kind of automatic model selection, making it easier to use for non-experts in data mining. (Of course I haven’t tried it, so it’s hard to evaluate the claim.) The example data set provided with the software package has to do with climate data and modeling, so it should be fun to try as soon as I get it working:
The new KnowledgeMiner is now capable of high-dimensional modeling and prediction of climate and has an included example using air and sea surface temperature data. This is a first for a data-mining software package: to offer anyone the ability to see for themselves that global temperatures are rising steadily, using publicly available data. The biggest surprise is seeing that the changes are greatest and accelerating in the northern latitudes. By using data from the past, KnowledgeMiner (yX) can show predictions for future years. Go to this link to see the climate change data displayed graphically in a slideshow through the year 2020:
There’s also an interesting new toolkit for web mining from BixoLabs. They’ve built what they call an elastic web mining platform in Amazon’s Elastic Compute Cloud (on top of Hadoop, Cascading and a web mining framework called Bixo, for those of you who care). The whole thing is pre-configured and scalable, and from the tutorials on the site, it seems pretty easy to set it up to crawl the web to your heart’s content.