## Hello 2012!

The first blog post of the new year. I made some updates to the Swedish big data company list from last year. I’ll recap the additions here so you don’t have to click on that link –

- Markify is a service that searches a large set of databases for registered trademarks that are similar, in sound or in writing, to a given query – like a name you have thought up for your next killer startup. As described on the company’s website, determining similarity is not that clear-cut, so (according to this write-up) they have adopted a data-driven strategy where they train their algorithm on “actual case literature of disputed trademark claims to help it discover trademarks that were similar enough to be contested.” They claim it’s the worl’d most accurate comprehensive trademark search.
- alaTest compiles, analyzes and rates product reviews to help customers select the most suitable product for them.
- Intellus is a business process / business intelligence company. Frankly, these terms and web sites like theirs normally make me fall asleep, but they have an ad for a master’s project out where they propose research to “find and implement an effective way of automating analysis in non-normalized data by applying different approaches of machine learning”, where the “platform for distributed big data analysis is already in place.” They promise a project at “the bleeding edge technology of machine learning and distributed big data analysis.”
- Although I haven’t listed AstraZeneca as a “big data” company (yet), they seem to be jumping the “data science” train as they are now advertising for “data angels” (!) and “predictive science data experts.”

On the US stage, I’m curious about a new company called BigML, which is apparently trying to tackle a problem that many have thought about or tried to solve, but which has proven very difficult, that is, to provide a user-friendly and general solution for building predictive models based on a data source. A machine learning solution for regular people, as it were. This blog post talks about some of the motivations behind it. I’ve applied for an invite and will write up a blog post if I get the chance to try it.

Finally, I’d like to recommend this Top 10 data mining links of 2011 list. I’m not usually very into top-10 lists, but this one contained some interesting stuff that I had missed. Of course, there is the MIC/MINE method which was published in Science, a clever generalization of correlation that works for non-linear relationships (to over-simplify a bit). As this blog post puts it, *“the consequential metric goes far beyond traditional measures of correlation, and rather towards what I would think of as a general pattern recognition algorithm that is sensitive to any type of systematic pattern between two variables (see the examples in Fig. 2 of the paper)*.”

Then there are of course the free data analysis textbooks, the free online ML and AI courses and IBM’s systems that defeated human Jeopardy champions, all of which I have covered here (I think.) Finally, there are links to two really cool papers. The first of them, Graphical Inference for Infoviz (where one of the authors is R luminary Hadley Wickham), introduces a very interesting method of “visual hypothesis testing” based on generating “decoy plots” that are based on the null hypothesis distribution, and letting a test person pick out the actual observed data among the decoys. The procedure has been implemented in an R package called *nullabor*. I really liked their analogy between hypothesis testing and a trial (the term “the statistical justice system”!):

Hypothesis testing is perhaps best understood with an analogy to the criminal justice system. The accused (data set) will be judged guilty or innocent based on the results of a trial (statistical test). Each trial has a defense (advocating for the null hypothesis) and a prosecution (advocating for the alternative hypothesis). On the basis of how evidence (the test statistic) compares to a standard (the p-value), the judge makes a decision to convict (reject the null) or acquit (fail to reject the null hypothesis). Unlike the criminal justice system, in the statistical justice system (SJS) evidence is based on the similarity between the accused and known innocents, using a specific metric defined by the test statistic. The population of innocents, called the null distribution, is generated by the combination of null hypothesis and test statistic. To determine the guilt of the accused we compute the proportion of innocents who look more guilty than the accused. This is the p-value, the probability that the accused would look this guilty if they actually were innocent.

The other very cool article is from Gary King’s lab and deals with the question of comparing different clusterings of data, and specifically determining a useful or insightful clustering for the user. They did this by implementing all (!) known clustering methods plus some new ones in a common interface in an R package. They then cluster text documents using all clustering methods and project the clusterings into a space that can be visualized and interactively explored to get a feeling for what the different methods are doing.

Mikael, thanks for your mention of and interest in BigML. Your ‘machine learning for regular people’ hits the nail on the head!

Looking forward to your response and feedback once we get started.

Great blog Mikael! HPCC Systems just released their Machine Learning library – a set of algorithms to assist with business intelligence including fully parallel machine learning routines. The HPCC Platform is an open source massive parallel-processing computing platform that solves Big Data problems. Learn more http://hpccsystems.com/ml