Follow the Data

A data driven blog

Archive for the category “People”

Hadley Wickham lecture: ggvis, tidyr, dplyr and much more

Another week, another great meetup. This time, the very prolific Hadley Wickham visited the Stockholm R useR group and talked for about an hour about his new projects.

Perhaps some background is in order. Hadleys PhD thesis (free pdf here) is a very inspiring tour of different aspects of practical data analysis issues, such as reshaping data into a “tidy” for that is easy to work with (he developed the R reshape package for this), visualizing clustering and classification problems (see his classifly, clusterfly, and meifly packages) and creating a consistent language for describing plots and graphics (which resulted in the influential ggplot2 package). He has also made the plyr package as a more consistent version of the various “apply” functions in R. I learned a lot from this thesis.

Today, Hadley talked about several new packages that he has been developing to further improve on his earlier toolkit. He said that in general, his packages become simpler and simpler as he re-defines the basic operations needed for data analysis.

  • The newest one (“I wrote it about four days ago”, Hadley said) is called tidyr (it’s not yet on CRAN but can be installed from GitHub) and provides functions for getting data into the “tidy” format mentioned above. While reshape had the melt and cast commands, tidyr has gather, separate, and spread.
  • dplyr – the “next iteration of plyr”, which is faster and focuses on data frames. It uses commands like select, filter, mutate, summarize, arrange.
  • ggvis – a “dynamic version of ggplot2” which is designed for responsive dynamic graphics, streaming visualization and meant for the web. This looked really nice. For example, you can easily add sliders to a plot so you can change the parameters and watch how the plot changes in real time. ggvis is built on Shiny but provides easier ways to make the plots. You can even embed dynamic ggvis plots in R markdown documents with knitR so that the resulting report can contain sliders and other things. This is obviously not possible with PDFs though. ggvis will be released on CRAN “in a week or so”.

Hadley also highlighted the magrittr package which implements a pipe operator for R (Magritte/pipe … get it? (groan)) The pipe looks like %>% and at first blush it may not look like a big deal, but Hadley made a convincing case that using the pipe together with (for example) dplyr results in code that is much easier to read, write and debug.

Hadley is writing a book, Advanced R (wiki version here), which he said has taught him a lot about the inner workings of R. He mentioned Rcpp as an excellent way to write C++ code and embed it in R packages. The bigvis package was mentioned as a “proof of concept” of how one might visualize big data sets (where the number of data points is larger than the number of pixels on the screen, so it is physically impossible to plot everything and summarization is necessary.)

A quotable Domingos paper

I’ve been (re-)reading Pedro Domingos’ paper, A Few Useful Things to Know About Machine Learning, and wanted to share some quotes that I like.

  • (…) much of the “folk knowledge” that is needed to successfully develop machine learning applications is not readily available in [textbooks].
  • Most textbooks are organized by representation [rather than the type of evaluation or optimization] and it’s easy to overlook the fact that the other components are equally important.
  • (…) if you hire someone to build a classifier, be sure to keep some of the data to yourself and test the classifier they give you on it.
  • Farmers combine seeds with nutrients to grow crops. Learners combine knowledge with data to grow programs.
  • What if the knowledge and data we have are not sufficient to completely determine the correct classifier? Then we run the risk of just hallucinating a classifier (…)
  • (…) strong false assumptions can be better than weak true ones, because a learner with the latter needs more data to avoid overfitting.
  • Even with a moderate dimension of 100 and a huge training set of a trillion examples, the latter cover only a fraction of about 10^-18 of the input space. This is what makes machine learning both necessary and hard.
  • (…) the most useful learners are those that facilitate incorporating knowledge.

Another interesting recent paper by Domingos is What’s Missing in AI: The Interface Layer.

Previously, Domingos has done a lot of interesting work on, for instance, why Naïve Bayes often works well even though its assumptions are not fulfilled, and why bagging works well. Those are just the ones I remember, I’m sure there is a lot more.

Follow the Data podcast, episode 4: Self-tracking with Niklas Laninge

In this episode of our podcast, we shift our focus from the “big data” themes in episodes 1-3 to personal data and self-tracking. We talked to Niklas Laninge, founder of Psykologifabriken (“The Psychology Factory”) and COO of Hoa’s Tool Shop, which are both relatively new startups based in Stockholm and which use applied psychology in innovate ways to facilitate lasting behavior change – in the case of the latter company, using digital tools such as smart phone apps. Niklas is also an avid collector of data on himself and describes some things he has found out by analyzing those data – and remarks that “When my [Nike] Fuelband broke, part of myself broke as well.”

At one point, I (Mikael) miserably failed to get the details right about The Human Face of Big Data project, which I erroneously call “Faces of Big Data” in the podcast. Also, I said that it was created by Greenplum, when in fact it was developed by Against All Odds productions (Rick Smolan and Jennifer Erwitt) and sponsored by EMC (of which Greenplum is a division.)

Some of the things we discussed:

Viary, a tools that facilitates behavior change in organizations or individuals

– Clinical trials showing promising results from using Viary to treat depression

– “Dance-offs” as a fun way to interact with people on the dance floor and get an extreme exercise session

Listen to the podcast | Follow The Data #4 : Self Tracking with Niklas Laninge

Follow the Data podcast, episode 3: Grokking Big Data with Paco Nathan

In this third episode of the Follow the Data podcast we talk to Paco Nathan, Data Scientist at Concurrent Inc.

Podcast link: http://s3.amazonaws.com/follow_the_data/FollowTheData_03_Podcast.mp3

Paco’s blog: http://ceteri.blogspot.se/

The running time is about one hour.

Paco’s internet connection died just as we were about to start the podcast so he had to connect via Skype on the iPhone. We apologize on the behalf of his internet provider in Silicon Valley for the reduced sound quality caused by this.

Here’s a few links to stuff we discussed:

http://www.cascading.org/
An application framework for Java developers to quickly and easily develop robust Data Analytics and Data Management applications on Apache Hadoop.

http://clojure.org/
A dialect of Lisp that runs on the JVM.

https://github.com/twitter/scalding
A Scala library that makes it easy to write MapReduce jobs in Hadoop.

http://www.cascading.org/multitool/
A simple command line interface for building large-scale data processing jobs based on Cascading.

http://en.wikipedia.org/wiki/CAP_theorem
states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency, Availability, Partition tolerance

http://www.nature.com/news/nanopore-genome-sequencer-makes-its-debut-1.10051
an article on the USB-sized Oxford Nanopore MinION sequencer

http://datakind.org/
Previously known as Data Without Borders this organisation aims to do good with Big Data.

http://www.climate.com/
Prediction based insurance for farmers.

wikipedia.org All_Watched_Over_by_Machines_of_Loving_Grace_(TV_series)
An interesting take on how programming culture has affected life. Link to episode #2 (http://vimeo.com/29875053)  “The use and abuse of vegetational concepts” – about how the idea of ecosystems came to be, sprung out of the notion of harmony in nature, how this influenced cybernetics and the perils of taking this animistic concept too far.

http://scratch.mit.edu/
A great way to teach kids to code.

http://www.stencyl.com/
Another interesting tool for teaching kids to code and build games.

http://www.minecraft.net/
Free form virtual reality game.

http://www.yelloworb.com/orbblog/
Some info on arduino-based wireless wind measurement project by Karl-Petter Åkesson (in Swedish).

http://www.fringeware.com/
A pioneering internet retailer that Paco was one of the founders for.

Practical advice for machine learning: bias, variance and what to do next

The online machine learning course given by Andrew Ng in 2011 (available here among many other places, including YouTube) is highly recommended in its entirety, but I just wanted to highlight a specific part of it, namely the “Practical advice part”, which touches on things that are not always included in machine learning and data mining courses, like “Deciding what do to do next” (the title of this lecture) or “debugging a learning algorithm” (the title of the first slide in that talk).

His advice here focuses on the concepts of the bias and variance  in statistical learning. I had been vaguely aware of the concepts of “bias and variance tradeoff” and “bias/variance decomposition” for a long time, but I had always viewed those as theoretical concepts that were mostly helpful for thinking about the properties of learning algorithms; I hadn’t thought that much about connecting them to the concrete tasks of model development.

As Andrew Ng explains, bias relates to the ability of your model function to approximate the data, and so high bias is related to under-fitting. For example, a linear regression model would have high bias when trying to model a quadratic relationship – no matter how you set the parameters, you can’t get a good training set error.

Variance on the other hand is about the stability of your model in response to new training examples. An algorithm like K-nearest neighbours (K-NN) has low bias (because it doesn’t really assume anything special about the distribution of the data points) but high variance, because it can easily change its prediction in response to the composition of the training set. K-NN can fit the training data very well if K is chosen small enough (in the extreme case with K=1 the fit will be perfect) but may not generalize well to new examples. So in short, high variance is related to over-fitting.

There is usually a tradeoff between bias and variance, and many learning algorithms have a built-in way to control this tradeoff, like for instance a regularization parameter that penalizes complex models in many linear modelling type approaches, or indeed the K value in K-NN. A lot more about the bias-variance tradeoff can be found in this Andrew Ng lecture.

Now, based on these concepts, Ng goes on to suggest some ways to modify your model when you discover it has a high error on a test set. Specifically, when should you:

– Get more training examples?

(Answer: When you have high variance. More training examples will not fix a high bias, because your underlying model will still not be able to approximate the correct function.)

– Try smaller sets of features?

(Answer: When you have higher variance. Ng says, if you think you have high bias, “for goodness’ sake don’t waste your time by trying to carefully select the best features”)

– Try to obtain new features?

(Answer: Usually works well when you suffer from high bias.)

Now you might wonder how you know that you have either high bias or high variance. This is where you can try to plot learning curves for your problem. You plot the error on the training set and on the cross-validation set as functions of the number of training examples for some set of training set sizes. (This of course requires you to randomly select examples from your training set, train models on them and assess the performance for each subset.)

In the typical high bias case, the cross-validation error will initially go down and then plateau as the number of training examples grow. (With high bias, more data doesn’t help beyond a certain point.) The training error will initially go up and then plateau at approximately the level of the cross-validation error (usually a fairly high level of error). So if you have similar cross-validation and training errors for a range of training set sizes, you may have a high-bias model and should look into generating new features or changing the model structure in some other way.

In the typical high variance case, the training error will increase somewhat with the number of training examples, but usually to a lower level than in the high-bias case. (The classifier is now more flexible and can fit the training data more easily, but will still suffer somewhat from having to adapt to many data points.) The cross-validation error will again start high and decrease with the number of training examples to a lower but still fairly high level. So the crucial diagnostic for the high variance case, says Ng, is that the difference between the cross-validation error and the training set error is high. In this case, you may want to try to obtain more data, or if that isn’t possible, decrease the number of features.

To summarize (using pictures from this PDF):

– Learning curves can tell you whether you appear to suffer from high bias or high variance.

– You can base your next step on what you found using the learning curves:

I think it’s nice to have this kind of rules of thumb when you get stuck, and I hope to follow up this post pretty soon with another one that deals with a relatively recent paper which suggests some neat ways to investigate a classification problem using sets of classfication models.

Quick links

Sergey Brin’s new science and IBM’s Jeopardy machine

Two good articles from the mainstream press.

Sergey Brin’s Search for a Parkinson’s Cure deals with the Google co-founders quest to minimize his high hereditary risk of getting Parkinson’s disease (which he found out through a test from 23andme, the company his wife founded) while simultaneously paving the way for a more rapid way to do science.

Brin is proposing to bypass centuries of scientific epistemology in favor of a more Googley kind of science. He wants to collect data first, then hypothesize, and then find the patterns that lead to answers. And he has the money and the algorithms to do it.

This idea about a less hypothesis-driven kind of science, based more on observing correlations and patterns, surfaces once in a while. A couple of years ago, Chris Anderson received a lot of criticism for describing what is more or less the same idea in The End of Theory. You can’t escape the need for some sort of theory or hypothesis, and when it comes to something like Parkinson we just don’t know enough about its physiology and biology yet. However, I think Brin is right in emphasizing the need to get data and knowledge about diseases to circulate more quickly and to try to milk the existing data sets for what they are worth. If nothing else, his frontal attack on Parkinson’s may lead to improved techniques for dealing with über-sized data sets.

Smarter Than You Think is about IBM’s new question-answering system Watson, which is apparently now good enough to be put in an actual Jeopardy competition on US national TV (scheduled to happen this fall). It’s a bit hard to believe, but I guess time will tell.

Most question-answering systems rely on a handful of algorithms, but Ferrucci decided this was why those systems do not work very well: no single algorithm can simulate the human ability to parse language and facts. Instead, Watson uses more than a hundred algorithms at the same time to analyze a question in different ways, generating hundreds of possible solutions. Another set of algorithms ranks these answers according to plausibility; for example, if dozens of algorithms working in different directions all arrive at the same answer, it’s more likely to be the right one.

IBM plans to sell Watson-like systems top corporate customers for sifting through huge document collections.

What do you do with a personal genome?

Now that the full sequencing of a person’s genome can be done for well below USD10,000 – Complete Genomics recently announced having sequenced three genomes for consumables costs between $1,726 and $8,005 – the question is what you would be able to do, today, with information about your genome.

Personalized Medicine recently published an article, Living with my personal genome by Jim Watson (co-discoverer of the structure of DNA.) The article is very short but it does tell us that Watson has changed his behavior in at least one way: he now takes beta-blockers only once a week instead of every day, because he discovered that he has an enzyme variant which causes him to metabolize the drug slowly, making him “…constantly fall asleep at inappropriate moments.” Apparently it took a whole-genome scan to realize that was abnormal!

Quantified Self has reported on its third New York Show & Tell session, where Esther Dyson, who also has had her genome sequenced, discussed what she had found out (video here). However, rather than the full genome sequence (which she calls “disappointing” in the beginning of the talk, saying that “it tells me nothing, I can’t interpret it” – if you think you could interpret it better, it’s online here), she focuses on her report from 23andme, which records information about a million SNPs (single-letter variations in the DNA) in each individual. She shows some rather nifty tools like the Relative Finder, which can be used to identify potential cousins.

Another early whole-genome sequencee, Steven Pinker, wrote a long and thoughtful article about his genome a while back in New York Times. Definitely worth a read.

Informavores

There’s a pretty interesting interview with a German thinker called Frank Schirrmacher, and comments on that interview, at edge.org. (I like this format – it’s a bit like those new online scientific journals where you can read the reviewers’ comments to the authors.) Schirrmacher talks about the concept of informavores,

…the human being as somebody eating information. So you can, in a way, see that the Internet and that the information overload we are faced with at this very moment has a lot to do with food chains, has a lot to do with food you take or not to take, with food which has many calories and doesn’t do you any good, and with food that is very healthy and is good for you.

He has some interesting thought on “dislocated” thought and the concept of free will …

…thinking itself somehow leaves the brain and uses a platform outside of the human body. And that, of course, is the Internet and it’s the cloud. Very soon we will have the brain in the cloud. And the raises the question about the importance of thoughts. For centuries, what was important for me was decided in my brain. But now, apparently, it will be decided somewhere else.

… and prediction:

What will this mean for the question of free will? Because, in the bottom line, there are, of course, algorithms, who analyze or who calculate certain predictabilities. And I’m wondering if the comfort of free will or not free will would be a very, very tough issue of the future.

[…]

The way we predict our own life, the way we are predicted by others, through the cloud, through the way we are linked to the Internet, will be matters that impact every aspect of our lives.

The interview is worth reading in full, as are the comments. I actually agree with many of the commenters who criticize Schirrmacher’s views, but the debate is interesting and he definitely has some novel ideas.

Two interviews

From The Future at Work podcast, a short video interview with Deborah Estrin about participatory sensing. This is essentially about people collectively compiling data, for instance using their cell phones (since that is today’s most ubiquitous and easy-to-use data collection device). Estrin describes an application of participatory sensing, What’s Invasive, where people locate invasive plants using their iPhone or Android. This could be, for instance, in a national parks, where both employees and trekkers would be able to snap geo-coded photos (through GPS, although the photos do not strictly need to be geo-coded; they can be annotated later through a website). There’s a strong overlap with citizen science here.

Estrin also briefly describes an interesting application which traces your own path through a city over days, weeks or years and mashes up the spatial information with data on air quality. Air quality varies in different locations in a city and over time, but with this application you can get a pretty good approximation of the pollution you tend to get exposed to. This may prompt a change in your regular bike route, for instance. (Bonus link: The Beijing air quality Twitter feed)

Also, H+ has an interview with Pattie Maes, who delivered the stunning Sixth Sense TED talk, where she tried to show what it could be like to have a “sixth sense for data”, as she put it.

Post Navigation