<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Follow the Data</title>
	<atom:link href="http://followthedata.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://followthedata.wordpress.com</link>
	<description>A data driven blog</description>
	<lastBuildDate>Mon, 16 Jan 2012 10:33:13 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='followthedata.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Follow the Data</title>
		<link>http://followthedata.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://followthedata.wordpress.com/osd.xml" title="Follow the Data" />
	<atom:link rel='hub' href='http://followthedata.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Hello 2012!</title>
		<link>http://followthedata.wordpress.com/2012/01/15/hello-2012/</link>
		<comments>http://followthedata.wordpress.com/2012/01/15/hello-2012/#comments</comments>
		<pubDate>Sun, 15 Jan 2012 23:54:00 +0000</pubDate>
		<dc:creator>Mikael Huss</dc:creator>
				<category><![CDATA[Companies]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[bigml]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[sweden]]></category>

		<guid isPermaLink="false">http://followthedata.wordpress.com/?p=789</guid>
		<description><![CDATA[The first blog post of the new year. I made some updates to the Swedish big data company list from last year. I&#8217;ll recap the additions here so you don&#8217;t have to click on that link - Markify is a service that searches a large set of databases for registered trademarks that are similar, in [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=789&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>The first blog post of the new year. I made some updates to the <a href="http://followthedata.wordpress.com/2011/03/21/big-data-companies-in-sweden/">Swedish big data company list</a> from last year. I&#8217;ll recap the additions here so you don&#8217;t have to click on that link -</p>
<ul>
<li><a href="http://www.markify.com">Markify</a> is a service that searches a large set of databases for registered trademarks that are similar, in sound or in writing, to a given query &#8211; like a name you have thought up for your next killer startup. As described on the company&#8217;s <a href="http://www.markify.com/why-use-an-availability-search.html">website</a>, determining similarity is not that clear-cut, so (according to <a href="http://www.arcticstartup.com/2012/01/13/markify-protects-your-startups-trademark-for-free-plus-tips-on-picking-your-next-companys-name">this write-up</a>) they have adopted a data-driven strategy where they train their algorithm on &#8220;actual case literature of disputed trademark claims to help it discover trademarks that were similar enough to be contested.&#8221; They claim it&#8217;s the worl&#8217;d most accurate comprehensive trademark search.</li>
<li><a href="http://www.crunchbase.com/company/alatest">alaTest </a>compiles, analyzes and rates product reviews to help customers select the most suitable product for them.</li>
<li><a href="http://www.intellus.se/">Intellus</a> is a business process / business intelligence company. Frankly, these terms and web sites like theirs normally make me fall asleep, but they have an <a href="http://www.xjobb.nu/Formedla-exjobb/VisaExjobb/?exjobbId=52875">ad for a master&#8217;s project</a> out where they propose research to &#8220;find and implement an effective way of automating analysis in non-normalized data by applying different approaches of machine learning&#8221;, where the &#8220;platform for distributed big data analysis is already in place.&#8221; They promise a project at &#8220;the bleeding edge technology of machine learning and distributed big data analysis.&#8221;</li>
<li>Although I haven&#8217;t listed AstraZeneca as a &#8220;big data&#8221; company (yet), they seem to be jumping the &#8220;data science&#8221; train as they are now advertising for &#8220;data angels&#8221; (!) and &#8220;predictive science data experts.&#8221;</li>
</ul>
<p>On the US stage, I&#8217;m curious about a new company called <a href="https://bigml.com/">BigML</a>, which is apparently trying to tackle a problem that many have thought about or tried to solve, but which has proven very difficult, that is, to provide a user-friendly and general solution for building predictive models based on a data source. A machine learning solution for regular people, as it were. <a href="http://blog.bigml.com/">This blog post</a> talks about some of the motivations behind it. I&#8217;ve applied for an invite and will write up a blog post if I get the chance to try it.</p>
<p>Finally, I&#8217;d like to recommend this <a href="http://www.pbs.org/idealab/2012/01/the-top-10-data-mining-links-of-2011006.html">Top 10 data mining links of 2011</a> list. I&#8217;m not usually very into top-10 lists, but this one contained some interesting stuff that I had missed. Of course, there is the <a href="http://mybiasedcoin.blogspot.com/2011/12/mic-and-mine-short-description.html">MIC/MINE method</a> which was published in Science, a clever generalization of correlation that works for non-linear relationships (to over-simplify a bit).  As <a href="http://theoreticalecology.wordpress.com/2011/12/16/the-maximal-information-coefficient/">this blog post</a> puts it, <em>&#8220;the consequential metric goes far beyond traditional measures of correlation, and rather towards what I would think of as a general pattern recognition algorithm that is sensitive to any type of systematic pattern between two variables (see the examples in Fig. 2 of the paper)</em>.&#8221;</p>
<p>Then there are of course the free data analysis textbooks, the free online ML and AI courses and IBM&#8217;s systems that defeated human Jeopardy champions, all of which I have covered here (I think.) Finally, there are links to two really cool papers. The first of them, <a href="http://jonathanstray.com/papers/wickham.pdf">Graphical Inference for Infoviz</a> (where one of the authors is R luminary Hadley Wickham), introduces a very interesting method of &#8220;visual hypothesis testing&#8221; based on generating &#8220;decoy plots&#8221; that are based on the null hypothesis distribution, and letting a test person pick out the actual observed data among the decoys. The procedure has been implemented in an R package called <em>nullabor</em>. I really liked their analogy between hypothesis testing and a trial (the term &#8220;the statistical justice system&#8221;!):</p>
<blockquote><p>Hypothesis testing is perhaps best understood with an analogy to the criminal justice system. The accused (data set) will be judged guilty or innocent based on the results of a trial (statistical test). Each trial has a defense (advocating for the null hypothesis) and a prosecution (advocating for the alternative hypothesis). On the basis of how evidence (the test statistic) compares to a standard (the p-value), the judge makes a decision to convict (reject the null) or acquit (fail to reject the null hypothesis). Unlike the criminal justice system, in the statistical justice system (SJS) evidence is based on the similarity between the accused and known innocents, using a specific metric defined by the test statistic. The population of innocents, called the null distribution, is generated by the combination of null hypothesis and test statistic. To determine the guilt of the accused we compute the proportion of innocents who look more guilty than the accused. This is the p-value, the probability that the accused would look this guilty if they actually were innocent.</p></blockquote>
<p>The other very cool article is from Gary King&#8217;s lab and deals with the question of comparing different clusterings of data, and specifically determining a useful or insightful clustering for the user. They did this by implementing all (!) known clustering methods plus some new ones in a common interface in an R package. They then cluster text documents using all clustering methods and project the clusterings into a space that can be visualized and interactively explored to get a feeling for what the different methods are doing.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/followthedata.wordpress.com/789/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/followthedata.wordpress.com/789/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/followthedata.wordpress.com/789/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/followthedata.wordpress.com/789/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/followthedata.wordpress.com/789/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/followthedata.wordpress.com/789/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/followthedata.wordpress.com/789/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/followthedata.wordpress.com/789/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/followthedata.wordpress.com/789/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/followthedata.wordpress.com/789/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/followthedata.wordpress.com/789/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/followthedata.wordpress.com/789/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/followthedata.wordpress.com/789/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/followthedata.wordpress.com/789/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=789&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://followthedata.wordpress.com/2012/01/15/hello-2012/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/bf36ba627303241ad267f96b76f2b095?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mickot</media:title>
		</media:content>
	</item>
		<item>
		<title>Learning from prediction contests</title>
		<link>http://followthedata.wordpress.com/2011/11/28/learning-from-prediction-contests/</link>
		<comments>http://followthedata.wordpress.com/2011/11/28/learning-from-prediction-contests/#comments</comments>
		<pubDate>Mon, 28 Nov 2011 23:33:10 +0000</pubDate>
		<dc:creator>Mikael Huss</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://followthedata.wordpress.com/2011/11/28/learning-from-prediction-contests/</guid>
		<description><![CDATA[I think there has never been a time when it has been easier to get into machine learning and predictive analytics than right know. Let me explain &#8230; As you probably know, a company called Kaggle organizes predictive analytics competitions where data scientists can earn money from their skills and companies can tap into some [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=784&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I think there has never been a time when it has been easier to get into machine learning and predictive analytics than right know. Let me explain &#8230;</p>
<p>As you probably know, a company called <a href="http://www.kaggle.com/">Kaggle</a> organizes predictive analytics competitions where data scientists can earn money from their skills and companies can tap into some of the unknown talent of the world. There are other similar companies, like <a href="http://www.crowdanalytix.com/welcome">CrowdAnalytix</a>, and more specialized/closed variants on the same idea such as <a href="http://www.innocentive.com/">Innocentive</a>, but I think Kaggle has deservedly gotten the most buzz because they have succeeded the best in presenting their business case and vision. For example, <a href="http://media.kaggle.com/strata2011.html">here</a> is a presentation that Jeremy Howard from Kaggle gave at the Strata NY 2011 conference, where he outlines how Kaggle wants to become a &#8220;meritocractic&#8221; platform that allows people who are good at analytics to finally get properly compensated for their skills.</p>
<p>I have known about Kaggle for quite some time and been a fan of their business idea, but with one full-time job, occasional work on the side and two young kids, I figured I&#8217;d never have the time to participate fruitfully in the competitions myself. As it happened, I got the chance to chat with Kaggle&#8217;s CEO Anthony Goldbloom at a conference (Strata 2011 in Santa Clara) and he persuaded me to give the competitions a try. So I finally jumped in, and found that despite not really having the time to spare, I still enjoyed it and learned a lot. So far I&#8217;ve only participated for real in one competition, the <a href="http://www.kaggle.com/c/dunnhumbychallenge">Dunnhumby Shopper Challenge</a>, where the task was to predict (based on historical shopping records on thousands of customers) at what date each customer would next visit the store, and how much money (within $10) he or she would spend. This task turned out to be surprisingly non-standard and was definitely not something that you could just throw your favourite algorithm at right out of the box.</p>
<p>Already from this one competition I learned / noticed several things:</p>
<p>- You can sometimes get pretty far just by using common sense and a very simple conceptual model. In fact the winning entry by Alexander d&#8217;Yakonov (explained <a href="http://blog.kaggle.com/2011/10/16/kernel-density-at-the-checkout/">here</a>) used essentially the same basic idea as my model, although he had added a couple of tricks that I hadn&#8217;t thought of.</p>
<p>- It&#8217;s extremely helpful to learn from your competitors. Kaggle often asks high-scoring contestants to explain how they did it, which is a huge service to the community. For Dunnhumby, there was the winning entry that I linked above, plus <a href="http://blog.kaggle.com/2011/10/20/creatures-of-habit-neil-schneider/">this</a> from Neil Schneider, who placed second, and <a href="http://blog.kaggle.com/2011/10/19/deceitful-beast-william-cukierski/">this</a> from William Cukierski, who placed fourth. Similar explanations for other competitions can be found under the <a href="http://blog.kaggle.com/category/how-i-did-it/">&#8220;How I did it&#8221; tag on Kaggle&#8217;s blog</a>.</p>
<p>- A competition can really motivate you to learn new stuff that you wouldn&#8217;t have dreamed of touching otherwise. The Dunnhumby competition motivated me to learn survival analysis, although I didn&#8217;t end up using that particular statistical framework. (I tried, but couldn&#8217;t get it to work well on the problem.) I also started to brush up on time series analysis.</p>
<p>During the past few days, I&#8217;ve discovered a couple of really, really good resources about how to get started with prediction contests:</p>
<p><a href="http://prezi.com/8fbsaa7mushs/using-r-for-data-mining-competitions/">Using R for data mining competitions</a> by Jonathan Lee shows case studies from Kaggle competitions. It&#8217;s heavy on the R material (which I like) but go look at it even if you don&#8217;t use R, as there is a lot more to the presentation than code.</p>
<p><a href="http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-howard/">Getting in shape for the sport of data science</a> by Jeremy Howard of Kaggle (again). This is a really great nuts-and-bolts talk about how to compete in prediction contests, with useful tips on how to &#8220;munge&#8221; your data into shape using different tools &#8211; even Excel! &#8211; and set up your models. There are many nice tricks here. Finally he explains the ideas behind the random forest algorithm &#8211; &#8220;a lot of crappy predictors that are all crap in a slightly different way.&#8221; For a while I was tempted to apply the same idea to Kaggle competitors &#8211; all are crap in a different way but the occasional competitor stumbles on something good and the rest cancel each other&#8217;s errors out <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  but this theory doesn&#8217;t hold as many of the top competitors (such as Jeremy himself before he joined Kaggle) are  consistently good.</p>
<p>Soo .. let&#8217;s see if I will have time to dive into the next contest in earnest &#8230;</p>
<p>P. S. Other interesting resources for learning about prediction / machine learning etc. apart from the stellar presentations mentioned above, and the Kaggle &#8220;How I did it&#8221; testimonials, include:</p>
<p><a href="http://www.ai-class.com/">Stanford&#8217;s free online AI class</a> with Peter Norvig and Sebastian Thrun (of Google&#8217;s self-driving car fame)- I&#8217;ve been trying to follow this, but predictably enough haven&#8217;t had time to keep up</p>
<p><a href="http://www.ml-class.org/course/auth/welcome">Stanford&#8217;s online machine learning class</a> with Andrew Ng &#8211; I&#8217;ve only watched one lecture but it seems really good</p>
<p><a href="http://www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml">Free lectures from Tom Mitchell&#8217;s machine learning class</a></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/followthedata.wordpress.com/784/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/followthedata.wordpress.com/784/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/followthedata.wordpress.com/784/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/followthedata.wordpress.com/784/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/followthedata.wordpress.com/784/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/followthedata.wordpress.com/784/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/followthedata.wordpress.com/784/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/followthedata.wordpress.com/784/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/followthedata.wordpress.com/784/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/followthedata.wordpress.com/784/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/followthedata.wordpress.com/784/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/followthedata.wordpress.com/784/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/followthedata.wordpress.com/784/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/followthedata.wordpress.com/784/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=784&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://followthedata.wordpress.com/2011/11/28/learning-from-prediction-contests/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/bf36ba627303241ad267f96b76f2b095?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mickot</media:title>
		</media:content>
	</item>
		<item>
		<title>Gavagai</title>
		<link>http://followthedata.wordpress.com/2011/11/25/gavagai/</link>
		<comments>http://followthedata.wordpress.com/2011/11/25/gavagai/#comments</comments>
		<pubDate>Fri, 25 Nov 2011 23:19:20 +0000</pubDate>
		<dc:creator>Mikael Huss</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://followthedata.wordpress.com/?p=587</guid>
		<description><![CDATA[Gavagai, which I mentioned in a post about Swedish big data companies earlier this year, seems to have come out of stealth mode. They have launched a blog and have started to talk about their Ethersource technology for text analysis. Looking at the use cases in the blog, one gets the impression of a kind [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=587&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.gavagai.se/">Gavagai</a>, which I mentioned in a post about<a href="http://followthedata.wordpress.com/2011/03/21/big-data-companies-in-sweden/"> Swedish big data companies</a> earlier this year, seems to have come out of stealth mode. They have launched a <a href="http://gavagai.se/blog/">blog</a> and have started to talk about their <a href="http://www.gavagai.se/ethersource-technology.php">Ethersource</a> technology for text analysis. Looking at the use cases in the blog, one gets the impression of a kind of sentiment analysis engine on steroids (although the site also mentions data journalism and author profiling as example applications), but in fact the technology is much more interesting than that. As the <a href="http://www.gavagai.se/ethersource-technology.php">Ethersource page</a> describes, the system does not use any pre-existing knowledge but rather attempts to learn concepts from strings of symbols (or &#8220;computes and tracks relations between terms in symbols in streaming language data&#8221; as the page also has it.) This also makes the platform more or less language-agnostic and thus suitable for multilingual analysis. Of course (?), the system is constantly learning, rather than being &#8220;trained&#8221; and updated in discrete jumps.</p>
<p>Gavagai&#8217;s <a href="http://gavagai.se/blog/">blog</a> contains some case studies where online social media have been monitored using the system in order to understand customer loyalty of US mobile network operators, reactions to Rick Perry&#8217;s botched interview, the apparently fading interest in Julian Assange, etc.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/followthedata.wordpress.com/587/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/followthedata.wordpress.com/587/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/followthedata.wordpress.com/587/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/followthedata.wordpress.com/587/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/followthedata.wordpress.com/587/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/followthedata.wordpress.com/587/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/followthedata.wordpress.com/587/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/followthedata.wordpress.com/587/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/followthedata.wordpress.com/587/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/followthedata.wordpress.com/587/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/followthedata.wordpress.com/587/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/followthedata.wordpress.com/587/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/followthedata.wordpress.com/587/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/followthedata.wordpress.com/587/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=587&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://followthedata.wordpress.com/2011/11/25/gavagai/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/bf36ba627303241ad267f96b76f2b095?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mickot</media:title>
		</media:content>
	</item>
		<item>
		<title>Podcasts</title>
		<link>http://followthedata.wordpress.com/2011/07/13/podcasts/</link>
		<comments>http://followthedata.wordpress.com/2011/07/13/podcasts/#comments</comments>
		<pubDate>Wed, 13 Jul 2011 18:12:43 +0000</pubDate>
		<dc:creator>Mikael Huss</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://followthedata.wordpress.com/?p=585</guid>
		<description><![CDATA[A couple of radio shows/podcasts I&#8217;ve enjoyed in the last few months: Numbers from This American Life. This show from 1998 (!) which predates the Quantified Self movement by about 10 years, is as fresh and interesting as anything from the QS movement today. Available for one dollar or so from iTunes. NPR&#8217;s On the [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=585&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>A couple of radio shows/podcasts I&#8217;ve enjoyed in the last few months:</p>
<p><a href="http://www.thisamericanlife.org/radio-archives/episode/88/numbers">Numbers</a> from This American Life. This show from 1998 (!) which predates the Quantified Self movement by about 10 years, is as fresh and interesting as anything from the QS movement today. Available for one dollar or so from iTunes.</p>
<p><a href="http://www.onthemedia.org/2011/may/13/?utm_source=feedburner&amp;utm_medium=%24{feed}&amp;utm_campaign=Feed%3A+%24{otm}+%28%24{On+the+Media}%29">NPR&#8217;s On the Media</a> had a good data-themed show on May 13, 2011, with material on data journalism, personal data, the &#8220;<a href="http://www.newyorker.com/online/blogs/newsdesk/2011/01/jonah-lehrer-more-thoughts-on-the-decline-effect.html">decline effect</a>&#8221; and (particularly interestingly) two cautionary tales about relying too much on a data driven approach. Freely available on iTunes (etc.)</p>
<p>I&#8217;ve also enjoyed the <a href="http://sagebase.org/WP/pod/">numerous videos</a> from the Sage Commons Conference in April. The main themes are summarized in the web site slogan &#8220;genomics, health innovation and open science.&#8221; Sage, which I&#8217;ve blogged about several times before, is about open data, open science, disease network modelling, Bayesian networks, amongst other things.</p>
<p>&nbsp;</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/followthedata.wordpress.com/585/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/followthedata.wordpress.com/585/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/followthedata.wordpress.com/585/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/followthedata.wordpress.com/585/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/followthedata.wordpress.com/585/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/followthedata.wordpress.com/585/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/followthedata.wordpress.com/585/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/followthedata.wordpress.com/585/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/followthedata.wordpress.com/585/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/followthedata.wordpress.com/585/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/followthedata.wordpress.com/585/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/followthedata.wordpress.com/585/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/followthedata.wordpress.com/585/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/followthedata.wordpress.com/585/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=585&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://followthedata.wordpress.com/2011/07/13/podcasts/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/bf36ba627303241ad267f96b76f2b095?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mickot</media:title>
		</media:content>
	</item>
		<item>
		<title>Genomics / big data / machine learning job opportunities</title>
		<link>http://followthedata.wordpress.com/2011/07/13/genomics-big-data-machine-learning-job-opportunities/</link>
		<comments>http://followthedata.wordpress.com/2011/07/13/genomics-big-data-machine-learning-job-opportunities/#comments</comments>
		<pubDate>Wed, 13 Jul 2011 17:44:40 +0000</pubDate>
		<dc:creator>Mikael Huss</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://followthedata.wordpress.com/?p=581</guid>
		<description><![CDATA[A couple of interesting job opportunities (both companies are in the US) for people with a genomics/data profile: Ion Flux (a company that I haven&#8217;t heard of until now) works in &#8220;Clinical Genomics + High-Performance Computing&#8221; and looks for people with the following characteristics: Do you want to work with an innovative, dynamic team that’s [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=581&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>A couple of interesting job opportunities (both companies are in the US) for people with a genomics/data profile:</p>
<p><a href="http://ionflux.com/blog/careers/">Ion Flux</a> (a company that I haven&#8217;t heard of until now) works in &#8220;Clinical Genomics + High-Performance Computing&#8221; and looks for people with the following characteristics:</p>
<blockquote><p>Do you want to work with an innovative, dynamic team that’s defining the state of the art in clinical diagnostics and healthcare?</p>
<p>Do you see big data problems as an opportunity to produce big results? Are you fascinated by the science, statistics, algorithms, and infrastructure needed to realize the possibilities?</p></blockquote>
<p><a href="http://halcyonmolecular.com/team/positions-available/">Halcyon Molecular</a>, a DNA sequencing company that &#8220;aims to transform biology into an information science&#8221;, is looking for people who know statistical learning, in particular Hidden Markov Models and Bayesian statistics. Halcyon&#8217;s web page says:</p>
<blockquote><p>Current gene sequencing methods are too slow, too expensive, and too incomplete to make &#8216;personalized medicine&#8217; more than a buzzword.  Our nanorobotics and single-atom detection approach will enable megabase single-molecule reads in milliseconds.</p></blockquote>
<p>In order to make this reality, they need people who are good at image processing and the above-mentioned statistical learning approaches.</p>
<p>&nbsp;</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/followthedata.wordpress.com/581/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/followthedata.wordpress.com/581/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/followthedata.wordpress.com/581/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/followthedata.wordpress.com/581/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/followthedata.wordpress.com/581/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/followthedata.wordpress.com/581/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/followthedata.wordpress.com/581/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/followthedata.wordpress.com/581/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/followthedata.wordpress.com/581/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/followthedata.wordpress.com/581/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/followthedata.wordpress.com/581/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/followthedata.wordpress.com/581/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/followthedata.wordpress.com/581/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/followthedata.wordpress.com/581/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=581&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://followthedata.wordpress.com/2011/07/13/genomics-big-data-machine-learning-job-opportunities/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/bf36ba627303241ad267f96b76f2b095?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mickot</media:title>
		</media:content>
	</item>
		<item>
		<title>Not contagious after all?</title>
		<link>http://followthedata.wordpress.com/2011/06/08/not-contagious-after-all/</link>
		<comments>http://followthedata.wordpress.com/2011/06/08/not-contagious-after-all/#comments</comments>
		<pubDate>Wed, 08 Jun 2011 20:13:23 +0000</pubDate>
		<dc:creator>Mikael Huss</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[contagious]]></category>
		<category><![CDATA[social-networks]]></category>

		<guid isPermaLink="false">http://followthedata.wordpress.com/?p=573</guid>
		<description><![CDATA[(via Decision Science News) Ouch! A new paper titled &#8220;The Spread of Evidence-Poor Medicine via Flawed Social-Network Analysis&#8221; (published here and available in manuscript format on arXiv) has come out arguing very strongly against the conclusions drawn by Christakis and Fowler in a series of papers where they put forward the idea that things like [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=573&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>(via <a href="http://www.decisionsciencenews.com/2011/06/03/questioning-the-evidence-of-influence-in-social-networks/">Decision Science News</a>) Ouch! A new paper titled &#8220;<em>The Spread of Evidence-Poor Medicine via Flawed Social-Network Analysis</em>&#8221; (published <a href="http://www.bepress.com/spp/vol2/iss1/2/">here</a> and available in manuscript format on <a href="http://arxiv.org/abs/1007.2876">arXiv</a>) has come out arguing very strongly against the conclusions drawn by Christakis and Fowler in a series of papers where they put forward the idea that things like obesity and smoking can be transmitted through social networks; a kind of &#8220;social contagion.&#8221; I <a href="http://followthedata.wordpress.com/2009/09/14/everything-is-contagious/">blogged</a> about these ideas a while back after both Wired and the New York Times had published articles on them. The title (harsh!) and the abstract speaks for itself:</p>
<blockquote><p>The chronic widespread misuse of statistics is usually inadvertent, not intentional. We find cautionary examples in a series of recent papers by Christakis and Fowler that advance statistical arguments for the transmission via social networks of various personal characteristics, including obesity, smoking cessation, happiness, and loneliness. Those papers also assert that such influence extends to three degrees of separation in social networks. We shall show that these conclusions do not follow from Christakis and Fowler&#8217;s statistical analyses. In fact, their studies even provide some evidence against the existence of such transmission. The errors that we expose arose, in part, because the assumptions behind the statistical procedures used were insufficiently examined, not only by the authors, but also by the reviewers. Our examples are instructive because the practitioners are highly reputed, their results have received enormous popular attention, and the journals that published their studies are among the most respected in the world. An educational bonus emerges from the difficulty we report in getting our critique published. We discuss the relevance of this episode to understanding statistical literacy and the role of scientific review, as well as to reforming statistics education.</p></blockquote>
<p>Cosma Shalizi has co-authored another paper (available <a href="http://smr.sagepub.com/content/40/2/211.full.pdf+html">here</a>) which makes a similar point in a much more, let&#8217;s say, polite way. My impression is that Shalizi is both sharp and trustworthy (I&#8217;ve learned a lot about statistics from his <a href="http://www.cscs.umich.edu/~crshalizi/weblog/">blog</a>) so I&#8217;m inclined to think he is on to something.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/followthedata.wordpress.com/573/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/followthedata.wordpress.com/573/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/followthedata.wordpress.com/573/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/followthedata.wordpress.com/573/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/followthedata.wordpress.com/573/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/followthedata.wordpress.com/573/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/followthedata.wordpress.com/573/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/followthedata.wordpress.com/573/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/followthedata.wordpress.com/573/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/followthedata.wordpress.com/573/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/followthedata.wordpress.com/573/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/followthedata.wordpress.com/573/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/followthedata.wordpress.com/573/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/followthedata.wordpress.com/573/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=573&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://followthedata.wordpress.com/2011/06/08/not-contagious-after-all/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/bf36ba627303241ad267f96b76f2b095?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mickot</media:title>
		</media:content>
	</item>
		<item>
		<title>MLDemos visualizes what classifiers do</title>
		<link>http://followthedata.wordpress.com/2011/06/07/mldemos-visualizes-what-classifiers-do/</link>
		<comments>http://followthedata.wordpress.com/2011/06/07/mldemos-visualizes-what-classifiers-do/#comments</comments>
		<pubDate>Tue, 07 Jun 2011 20:51:50 +0000</pubDate>
		<dc:creator>Mikael Huss</dc:creator>
				<category><![CDATA[Tools and Software]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://followthedata.wordpress.com/?p=571</guid>
		<description><![CDATA[MLDemos is based on a really nice idea &#8211; to visualize how different classifiers construct the decision boundaries around arbitrary sets of data points. I had of course seen the concept of decision boundaries before; in many machine-learning classes you will draw or at least get to see boundaries or surfaces that delineate the parts [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=571&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://mldemos.epfl.ch/">MLDemos</a> is based on a really nice idea &#8211; to visualize how different classifiers construct the decision boundaries around arbitrary sets of data points. I had of course seen the concept of decision boundaries before; in many machine-learning classes you will draw or at least get to see boundaries or surfaces that delineate the parts of the sample space where a classifier will yield different predictions. In MLDemos, you get to draw the points in the (2-D) sample space by hand, and you can choose between a variety of different algorithms. Or if you want, you can upload your own data sets. The software doesn&#8217;t just do decision boundaries, it also visualizes regression, clustering and dynamical systems in cool and downright beautiful ways.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/followthedata.wordpress.com/571/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/followthedata.wordpress.com/571/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/followthedata.wordpress.com/571/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/followthedata.wordpress.com/571/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/followthedata.wordpress.com/571/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/followthedata.wordpress.com/571/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/followthedata.wordpress.com/571/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/followthedata.wordpress.com/571/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/followthedata.wordpress.com/571/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/followthedata.wordpress.com/571/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/followthedata.wordpress.com/571/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/followthedata.wordpress.com/571/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/followthedata.wordpress.com/571/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/followthedata.wordpress.com/571/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=571&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://followthedata.wordpress.com/2011/06/07/mldemos-visualizes-what-classifiers-do/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/bf36ba627303241ad267f96b76f2b095?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mickot</media:title>
		</media:content>
	</item>
		<item>
		<title>Google Prediction API open to all</title>
		<link>http://followthedata.wordpress.com/2011/06/07/google-prediction-api-open-to-all/</link>
		<comments>http://followthedata.wordpress.com/2011/06/07/google-prediction-api-open-to-all/#comments</comments>
		<pubDate>Tue, 07 Jun 2011 20:39:02 +0000</pubDate>
		<dc:creator>Mikael Huss</dc:creator>
				<category><![CDATA[Tools and Software]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[prediction]]></category>

		<guid isPermaLink="false">http://followthedata.wordpress.com/?p=568</guid>
		<description><![CDATA[I&#8217;ve been eagerly waiting to use the Google Prediction API ever since it was announced, and now (since sometime in May) it&#8217;s open for everyone who has a Google account (and a credit card). Previously, you had to be able to provide a U.S. mailing address. Google&#8217;s Prediction API is basically a nice way to [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=568&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been eagerly waiting to use the <a href="http://code.google.com/apis/predict/docs/hello_world.html">Google Prediction API</a> ever since it was announced, and now (since sometime in May) it&#8217;s open for everyone who has a Google account (and a credit card). Previously, you had to be able to provide a U.S. mailing address.</p>
<p>Google&#8217;s Prediction API is basically a nice way to run your classification and/or prediction tasks through Google&#8217;s black-box set of machine learning tools. The way it works is that you upload your training data to <a href="http://code.google.com/apis/storage/">Google Storage</a>, which is something like Google&#8217;s version of Amazon&#8217;s <a href="http://aws.amazon.com/s3/">S3</a>: a cloud-based storage system where you store your data in &#8220;buckets&#8221;. (Google Storage, like S3, uses the term bucket and, also like S3, requires that bucket names only use lower-case letters.) You can activate both Google Storage and the Prediction API from the <a href="https://code.google.com/apis/console/#project:988555199071">Google APIs Console</a>. This is also where you will find (click &#8220;API access&#8221; on the left hand menu) the access key that you will need to run prediction tasks. You&#8217;ll have to give credit card details to pay for potential future usage.</p>
<p>The training examples that you put in Storage need to be formatted according to the <a href="http://code.google.com/apis/predict/docs/developer-guide.html#structuringyourtrainingdata">specification in the Developer&#8217;s Guide</a>. Once they have been uploaded, you can train a model on the uploaded data, make predictions about new examples, update existing models and more using one of the <a href="http://code.google.com/apis/predict/docs/libraries.html">client libraries</a> or even simpler, just by copying some of the bash scripts shown on the same page (hidden behind &#8216;+&#8217; signs which can be expanded.) For these bash scripts to work as written on that page, you need to paste your API key into a file called &#8216;googlekey&#8217; located in the directory from where you are running the script.</p>
<p>I used <a href="http://computationalbiologynews.blogspot.com/2011/05/cancer-classification-using-google.html">this walkthrough example about cancer classification from gene expression data</a> to get up to speed on how Google Prediction API works. Now I&#8217;m thinking about what data to throw at it next. Perhaps it would be fun to input some Kaggle contest data sets into it as a kind of &#8220;Google baseline&#8221; predictor? <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/followthedata.wordpress.com/568/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/followthedata.wordpress.com/568/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/followthedata.wordpress.com/568/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/followthedata.wordpress.com/568/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/followthedata.wordpress.com/568/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/followthedata.wordpress.com/568/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/followthedata.wordpress.com/568/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/followthedata.wordpress.com/568/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/followthedata.wordpress.com/568/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/followthedata.wordpress.com/568/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/followthedata.wordpress.com/568/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/followthedata.wordpress.com/568/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/followthedata.wordpress.com/568/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/followthedata.wordpress.com/568/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=568&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://followthedata.wordpress.com/2011/06/07/google-prediction-api-open-to-all/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/bf36ba627303241ad267f96b76f2b095?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mickot</media:title>
		</media:content>
	</item>
		<item>
		<title>Video mining and sports analytics</title>
		<link>http://followthedata.wordpress.com/2011/04/26/video-mining-and-sports-analytics/</link>
		<comments>http://followthedata.wordpress.com/2011/04/26/video-mining-and-sports-analytics/#comments</comments>
		<pubDate>Tue, 26 Apr 2011 21:44:08 +0000</pubDate>
		<dc:creator>Mikael Huss</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[sports]]></category>

		<guid isPermaLink="false">http://followthedata.wordpress.com/?p=564</guid>
		<description><![CDATA[I was one of those people who were blown away by Deb Roy&#8217;s TED presentation, where he showed how he had collected 90,000 hours worth of home video footage and mined that footage to find out exactly how his son developed language skills. Some commenters remarked that the actual amount of knowledge gained through this [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=564&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I was one of those people who were blown away by <a href="http://www.ted.com/talks/deb_roy_the_birth_of_a_word.html">Deb Roy&#8217;s TED presentation</a>, where he showed how he had collected 90,000 hours worth of home video footage and mined that footage to find out exactly how his son developed language skills. Some commenters remarked that the actual amount of knowledge gained through this exercise was not that impressive. That may be so, but I still thought it was remarkable how Roy&#8217;s MIT Media Lab team had managed to turn all of that video data into gorgeous visualizations that made intuitive sense (like the &#8220;3D density plots&#8221; of word usage throughout the house, for instance). There has to be a serious software infrastructure somewhere in there to enable this kind of analysis. Roy has a company, <a href="http://www.bluefinlabs.com/overview/company/">Bluefin Labs</a>, and I came across an intriguing <a href="http://www.masshightech.com/stories/2009/10/05/daily15-Bluefin-Labs-software-to-scan-sports-video.html">press release from 2009</a> which stated that Bluefin would start to &#8220;index video&#8221; for the consumer sports market, so that sports videos could be easily searched and analyzed. However, it seems Bluefin has since dropped the idea, as their <a href="http://www.bluefinlabs.com/overview/company/">home page</a> now talks about &#8220;measuring consumer response&#8221; through digital channels rather than sports video analytics.</p>
<p>However, there are other companies that have taken up the idea. A recent Wired article (<a href="http://www.wired.com/playbook/2011/04/nba-data-revolution/all/1">Hoops 2.0: Inside the NBA&#8217;s Data-Driven Revolution</a>) describes a system called <a href="http://www.sportvu.com/basketball.asp">SportsVU</a>, which uses video cameras to track the players and the ball to a remarkable level of detail. The technology grew out of missile-tracking applications and optical recognition algorithms to that. It was first applied to soccer, but it was later decided that basketball would be more lucrative. With the SportsVU system &#8211; and, not to forget, the analytics to crunch the raw image data &#8211; it&#8217;s possible to track some pretty complicated metrics, as illustrated in <a href="http://sportvuhoops.stats.com/kevin-martin-shooting-metrics/">this example</a> for Kevin Martin.</p>
<p>A different way to quantify basketball player performance is used by Infomotion Sports, as descibed in <a href="http://sports.yahoo.com/nba/blog/ball_dont_lie/post/Rise-of-the-machines-Will-coaches-cede-control-?urn=nba-330135">this Yahoo! Sports article</a>. Infomotion&#8217;s <a href="http://www.94fifty.com/technology">94Fifty</a> technology, instead of using video cameras, actually measures the movements of the ball directly using sensors embedded into the inside of the ball. Through &#8220;feeling&#8221; the ball&#8217;s motion, one can quantify a lot of things about a player, such as dribbling speed, shot angles and spin. A quote from the article:</p>
<blockquote>
<p id="yui_3_3_0_2_130385378851022">&#8220;A coach can tell me that [a player] needs to work on his left hand,&#8221; Kamil said. &#8220;[But] we can tell you that his right hand is 14 percent more dominant than his left.&#8221;</p>
</blockquote>
<p>Of course, a lot of people would be skeptical as to how much you really gain from this kind of data-driven approach compared to the good old gut feeling of a coach. I count myself among the skeptics, but I think the technology is intriguing.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/followthedata.wordpress.com/564/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/followthedata.wordpress.com/564/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/followthedata.wordpress.com/564/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/followthedata.wordpress.com/564/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/followthedata.wordpress.com/564/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/followthedata.wordpress.com/564/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/followthedata.wordpress.com/564/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/followthedata.wordpress.com/564/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/followthedata.wordpress.com/564/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/followthedata.wordpress.com/564/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/followthedata.wordpress.com/564/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/followthedata.wordpress.com/564/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/followthedata.wordpress.com/564/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/followthedata.wordpress.com/564/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=564&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://followthedata.wordpress.com/2011/04/26/video-mining-and-sports-analytics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/bf36ba627303241ad267f96b76f2b095?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mickot</media:title>
		</media:content>
	</item>
		<item>
		<title>Two good presentations</title>
		<link>http://followthedata.wordpress.com/2011/04/26/two-good-presentations/</link>
		<comments>http://followthedata.wordpress.com/2011/04/26/two-good-presentations/#comments</comments>
		<pubDate>Tue, 26 Apr 2011 10:28:40 +0000</pubDate>
		<dc:creator>Mikael Huss</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[geo-analytics]]></category>
		<category><![CDATA[presentations]]></category>

		<guid isPermaLink="false">http://followthedata.wordpress.com/?p=559</guid>
		<description><![CDATA[Two good recent presentations: Pete Skomoroch&#8217;s tutorial on &#8220;geo analytics&#8221;. A really good slide deck that manages to introduce [Elastic] MapReduce, Pig, Amazon Mechanical Turk, Python Natural Language Toolkit, GitHub and probably other stuff I don&#8217;t remember as seamless components in a project in a way that makes complete sense. In fact I liked this [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=559&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Two good recent presentations:</p>
<p><a href="http://www.slideshare.net/pskomoroch/geo-analytics-tutorial">Pete Skomoroch&#8217;s tutorial on &#8220;geo analytics&#8221;</a>. A really good slide deck that manages to introduce [Elastic] MapReduce, Pig, Amazon Mechanical Turk, Python Natural Language Toolkit, GitHub and probably other stuff I don&#8217;t remember as seamless components in a project in a way that makes complete sense. In fact I liked this much better than the same presenter&#8217;s talk at Strata 2011, maybe because this one goes more into details.</p>
<p>Similarly, I like Recorded Future developer <a href="http://assets.en.oreilly.com/1/event/56/Large%20datasets%20in%20MySQL%20on%20Amazon%20EC2%20Presentation.pdf">Anders Karlsson&#8217;s presentation about how Recorded Future uses Amazon EC2</a> because it gives a lot of detail about nitty-gritty, day-to-day data analysis/storage challenges. Karlsson also provides some non-obvious thoughts on what works well, and less well, when trying to manage very large data sets on EC2.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/followthedata.wordpress.com/559/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/followthedata.wordpress.com/559/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/followthedata.wordpress.com/559/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/followthedata.wordpress.com/559/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/followthedata.wordpress.com/559/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/followthedata.wordpress.com/559/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/followthedata.wordpress.com/559/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/followthedata.wordpress.com/559/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/followthedata.wordpress.com/559/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/followthedata.wordpress.com/559/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/followthedata.wordpress.com/559/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/followthedata.wordpress.com/559/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/followthedata.wordpress.com/559/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/followthedata.wordpress.com/559/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=followthedata.wordpress.com&amp;blog=8369482&amp;post=559&amp;subd=followthedata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://followthedata.wordpress.com/2011/04/26/two-good-presentations/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/bf36ba627303241ad267f96b76f2b095?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mickot</media:title>
		</media:content>
	</item>
	</channel>
</rss>
