Follow the Data

A data driven blog

Archive for the tag “web-mining”

1.2 zettabyte of data

OK, so I was a bit slow to discover this, but The Economist has a special report on big data which is freely available online. That is, the individual articles are free, and a PDF compiling them is supposed to cost 3 GBP, but I was able to download it for free here without doing anything special.

A fun fact that I learned from this report is that the total amount of information in the world this year is projected to reach 1.2 zb (zettabyte) – which is 1.2×10^21 byte. How on earth did they come up with that figure…? Anyway, this report is worth a read, as it touches on things like business analytics, web mining, open government data and augmented cognition, while also giving some well deserved love to R and open source software.

Advertisements

Mining data streams, the web, and the climate

I recently came across MOA (Massive Online Analysis), an environment for what its developers call massive data mining, or data stream mining. This New Zealand-based project is related to Weka, a Java-based framework for machine learning which I’ve used quite a bit over the years. Data stream mining differs from plain old data mining in that the data is assumed to arrive quickly and continuously, as in a stream, and in an unpredictable order. Therefore the full data set will typically be many times larger than your computer’s memory (which already rules out some commonly used algorithms), and each example can only be briefly examined once, after which it is discarded. Therefore the statistical model has to be updated incrementally, and often must be ready to be applied at any point between training examples.

I also came across a press release describing version 2.0 of KnowledgeMiner for Excel, a data mining software apparently used by customers like Pfizer, NASA and Boeing, and which is based on GMDH (Group Method of Data Handling), a paradigm I hadn’t heard about before. I failed to install KnowledgeMiner for Excel for my Mac due to some obscure install error, but from what I gather, the GMDH framework involves a kind of automatic model selection, making it easier to use for non-experts in data mining. (Of course I haven’t tried it, so it’s hard to evaluate the claim.) The example data set provided with the software package has to do with climate data and modeling, so it should be fun to try as soon as I get it working:

The new KnowledgeMiner is now capable of high-dimensional modeling and prediction of climate and has an included example using air and sea surface temperature data. This is a first for a data-mining software package: to offer anyone the ability to see for themselves that global temperatures are rising steadily, using publicly available data. The biggest surprise is seeing that the changes are greatest and accelerating in the northern latitudes. By using data from the past, KnowledgeMiner (yX) can show predictions for future years. Go to this link to see the climate change data displayed graphically in a slideshow through the year 2020:

There’s also an interesting new toolkit for web mining from BixoLabs. They’ve built what they call an elastic web mining platform in Amazon’s Elastic Compute Cloud (on top of Hadoop, Cascading and a web mining framework called Bixo, for those of you who care). The whole thing is pre-configured and scalable, and from the tutorials on the site, it seems pretty easy to set it up to crawl the web to your heart’s content.

Post Navigation