Follow the Data

A data driven blog

Archive for the category “People”

Follow the Data podcast, episode 4: Self-tracking with Niklas Laninge

In this episode of our podcast, we shift our focus from the “big data” themes in episodes 1-3 to personal data and self-tracking. We talked to Niklas Laninge, founder of Psykologifabriken (“The Psychology Factory”) and COO of Hoa’s Tool Shop, which are both relatively new startups based in Stockholm and which use applied psychology in innovate ways to facilitate lasting behavior change – in the case of the latter company, using digital tools such as smart phone apps. Niklas is also an avid collector of data on himself and describes some things he has found out by analyzing those data – and remarks that “When my [Nike] Fuelband broke, part of myself broke as well.”

At one point, I (Mikael) miserably failed to get the details right about The Human Face of Big Data project, which I erroneously call “Faces of Big Data” in the podcast. Also, I said that it was created by Greenplum, when in fact it was developed by Against All Odds productions (Rick Smolan and Jennifer Erwitt) and sponsored by EMC (of which Greenplum is a division.)

Some of the things we discussed:

- Viary, a tools that facilitates behavior change in organizations or individuals

- Clinical trials showing promising results from using Viary to treat depression

- “Dance-offs” as a fun way to interact with people on the dance floor and get an extreme exercise session

Listen to the podcast | Follow The Data #4 : Self Tracking with Niklas Laninge

Follow the Data podcast, episode 3: Grokking Big Data with Paco Nathan

In this third episode of the Follow the Data podcast we talk to Paco Nathan, Data Scientist at Concurrent Inc.

Podcast link: http://s3.amazonaws.com/follow_the_data/FollowTheData_03_Podcast.mp3

Paco’s blog: http://ceteri.blogspot.se/

The running time is about one hour.

Paco’s internet connection died just as we were about to start the podcast so he had to connect via Skype on the iPhone. We apologize on the behalf of his internet provider in Silicon Valley for the reduced sound quality caused by this.

Here’s a few links to stuff we discussed:

http://www.cascading.org/
An application framework for Java developers to quickly and easily develop robust Data Analytics and Data Management applications on Apache Hadoop.

http://clojure.org/
A dialect of Lisp that runs on the JVM.

https://github.com/twitter/scalding
A Scala library that makes it easy to write MapReduce jobs in Hadoop.

http://www.cascading.org/multitool/
A simple command line interface for building large-scale data processing jobs based on Cascading.

http://en.wikipedia.org/wiki/CAP_theorem
states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency, Availability, Partition tolerance

http://www.nature.com/news/nanopore-genome-sequencer-makes-its-debut-1.10051
an article on the USB-sized Oxford Nanopore MinION sequencer

http://datakind.org/
Previously known as Data Without Borders this organisation aims to do good with Big Data.

http://www.climate.com/
Prediction based insurance for farmers.

wikipedia.org All_Watched_Over_by_Machines_of_Loving_Grace_(TV_series)
An interesting take on how programming culture has affected life. Link to episode #2 (http://vimeo.com/29875053)  “The use and abuse of vegetational concepts” – about how the idea of ecosystems came to be, sprung out of the notion of harmony in nature, how this influenced cybernetics and the perils of taking this animistic concept too far.

http://scratch.mit.edu/
A great way to teach kids to code.

http://www.stencyl.com/
Another interesting tool for teaching kids to code and build games.

http://www.minecraft.net/
Free form virtual reality game.

http://www.yelloworb.com/orbblog/
Some info on arduino-based wireless wind measurement project by Karl-Petter Åkesson (in Swedish).

http://www.fringeware.com/
A pioneering internet retailer that Paco was one of the founders for.

Practical advice for machine learning: bias, variance and what to do next

The online machine learning course given by Andrew Ng in 2011 (available here among many other places, including YouTube) is highly recommended in its entirety, but I just wanted to highlight a specific part of it, namely the “Practical advice part”, which touches on things that are not always included in machine learning and data mining courses, like “Deciding what do to do next” (the title of this lecture) or “debugging a learning algorithm” (the title of the first slide in that talk).

His advice here focuses on the concepts of the bias and variance  in statistical learning. I had been vaguely aware of the concepts of “bias and variance tradeoff” and “bias/variance decomposition” for a long time, but I had always viewed those as theoretical concepts that were mostly helpful for thinking about the properties of learning algorithms; I hadn’t thought that much about connecting them to the concrete tasks of model development.

As Andrew Ng explains, bias relates to the ability of your model function to approximate the data, and so high bias is related to under-fitting. For example, a linear regression model would have high bias when trying to model a quadratic relationship – no matter how you set the parameters, you can’t get a good training set error.

Variance on the other hand is about the stability of your model in response to new training examples. An algorithm like K-nearest neighbours (K-NN) has low bias (because it doesn’t really assume anything special about the distribution of the data points) but high variance, because it can easily change its prediction in response to the composition of the training set. K-NN can fit the training data very well if K is chosen small enough (in the extreme case with K=1 the fit will be perfect) but may not generalize well to new examples. So in short, high variance is related to over-fitting.

There is usually a tradeoff between bias and variance, and many learning algorithms have a built-in way to control this tradeoff, like for instance a regularization parameter that penalizes complex models in many linear modelling type approaches, or indeed the K value in K-NN. A lot more about the bias-variance tradeoff can be found in this Andrew Ng lecture.

Now, based on these concepts, Ng goes on to suggest some ways to modify your model when you discover it has a high error on a test set. Specifically, when should you:

- Get more training examples?

(Answer: When you have high variance. More training examples will not fix a high bias, because your underlying model will still not be able to approximate the correct function.)

- Try smaller sets of features?

(Answer: When you have higher variance. Ng says, if you think you have high bias, “for goodness’ sake don’t waste your time by trying to carefully select the best features”)

- Try to obtain new features?

(Answer: Usually works well when you suffer from high bias.)

Now you might wonder how you know that you have either high bias or high variance. This is where you can try to plot learning curves for your problem. You plot the error on the training set and on the cross-validation set as functions of the number of training examples for some set of training set sizes. (This of course requires you to randomly select examples from your training set, train models on them and assess the performance for each subset.)

In the typical high bias case, the cross-validation error will initially go down and then plateau as the number of training examples grow. (With high bias, more data doesn’t help beyond a certain point.) The training error will initially go up and then plateau at approximately the level of the cross-validation error (usually a fairly high level of error). So if you have similar cross-validation and training errors for a range of training set sizes, you may have a high-bias model and should look into generating new features or changing the model structure in some other way.

In the typical high variance case, the training error will increase somewhat with the number of training examples, but usually to a lower level than in the high-bias case. (The classifier is now more flexible and can fit the training data more easily, but will still suffer somewhat from having to adapt to many data points.) The cross-validation error will again start high and decrease with the number of training examples to a lower but still fairly high level. So the crucial diagnostic for the high variance case, says Ng, is that the difference between the cross-validation error and the training set error is high. In this case, you may want to try to obtain more data, or if that isn’t possible, decrease the number of features.

To summarize (using pictures from this PDF):

- Learning curves can tell you whether you appear to suffer from high bias or high variance.

- You can base your next step on what you found using the learning curves:

I think it’s nice to have this kind of rules of thumb when you get stuck, and I hope to follow up this post pretty soon with another one that deals with a relatively recent paper which suggests some neat ways to investigate a classification problem using sets of classfication models.

Quick links

Sergey Brin’s new science and IBM’s Jeopardy machine

Two good articles from the mainstream press.

Sergey Brin’s Search for a Parkinson’s Cure deals with the Google co-founders quest to minimize his high hereditary risk of getting Parkinson’s disease (which he found out through a test from 23andme, the company his wife founded) while simultaneously paving the way for a more rapid way to do science.

Brin is proposing to bypass centuries of scientific epistemology in favor of a more Googley kind of science. He wants to collect data first, then hypothesize, and then find the patterns that lead to answers. And he has the money and the algorithms to do it.

This idea about a less hypothesis-driven kind of science, based more on observing correlations and patterns, surfaces once in a while. A couple of years ago, Chris Anderson received a lot of criticism for describing what is more or less the same idea in The End of Theory. You can’t escape the need for some sort of theory or hypothesis, and when it comes to something like Parkinson we just don’t know enough about its physiology and biology yet. However, I think Brin is right in emphasizing the need to get data and knowledge about diseases to circulate more quickly and to try to milk the existing data sets for what they are worth. If nothing else, his frontal attack on Parkinson’s may lead to improved techniques for dealing with über-sized data sets.

Smarter Than You Think is about IBM’s new question-answering system Watson, which is apparently now good enough to be put in an actual Jeopardy competition on US national TV (scheduled to happen this fall). It’s a bit hard to believe, but I guess time will tell.

Most question-answering systems rely on a handful of algorithms, but Ferrucci decided this was why those systems do not work very well: no single algorithm can simulate the human ability to parse language and facts. Instead, Watson uses more than a hundred algorithms at the same time to analyze a question in different ways, generating hundreds of possible solutions. Another set of algorithms ranks these answers according to plausibility; for example, if dozens of algorithms working in different directions all arrive at the same answer, it’s more likely to be the right one.

IBM plans to sell Watson-like systems top corporate customers for sifting through huge document collections.

What do you do with a personal genome?

Now that the full sequencing of a person’s genome can be done for well below USD10,000 – Complete Genomics recently announced having sequenced three genomes for consumables costs between $1,726 and $8,005 – the question is what you would be able to do, today, with information about your genome.

Personalized Medicine recently published an article, Living with my personal genome by Jim Watson (co-discoverer of the structure of DNA.) The article is very short but it does tell us that Watson has changed his behavior in at least one way: he now takes beta-blockers only once a week instead of every day, because he discovered that he has an enzyme variant which causes him to metabolize the drug slowly, making him “…constantly fall asleep at inappropriate moments.” Apparently it took a whole-genome scan to realize that was abnormal!

Quantified Self has reported on its third New York Show & Tell session, where Esther Dyson, who also has had her genome sequenced, discussed what she had found out (video here). However, rather than the full genome sequence (which she calls “disappointing” in the beginning of the talk, saying that “it tells me nothing, I can’t interpret it” – if you think you could interpret it better, it’s online here), she focuses on her report from 23andme, which records information about a million SNPs (single-letter variations in the DNA) in each individual. She shows some rather nifty tools like the Relative Finder, which can be used to identify potential cousins.

Another early whole-genome sequencee, Steven Pinker, wrote a long and thoughtful article about his genome a while back in New York Times. Definitely worth a read.

Informavores

There’s a pretty interesting interview with a German thinker called Frank Schirrmacher, and comments on that interview, at edge.org. (I like this format – it’s a bit like those new online scientific journals where you can read the reviewers’ comments to the authors.) Schirrmacher talks about the concept of informavores,

…the human being as somebody eating information. So you can, in a way, see that the Internet and that the information overload we are faced with at this very moment has a lot to do with food chains, has a lot to do with food you take or not to take, with food which has many calories and doesn’t do you any good, and with food that is very healthy and is good for you.

He has some interesting thought on “dislocated” thought and the concept of free will …

…thinking itself somehow leaves the brain and uses a platform outside of the human body. And that, of course, is the Internet and it’s the cloud. Very soon we will have the brain in the cloud. And the raises the question about the importance of thoughts. For centuries, what was important for me was decided in my brain. But now, apparently, it will be decided somewhere else.

… and prediction:

What will this mean for the question of free will? Because, in the bottom line, there are, of course, algorithms, who analyze or who calculate certain predictabilities. And I’m wondering if the comfort of free will or not free will would be a very, very tough issue of the future.

[...]

The way we predict our own life, the way we are predicted by others, through the cloud, through the way we are linked to the Internet, will be matters that impact every aspect of our lives.

The interview is worth reading in full, as are the comments. I actually agree with many of the commenters who criticize Schirrmacher’s views, but the debate is interesting and he definitely has some novel ideas.

Two interviews

From The Future at Work podcast, a short video interview with Deborah Estrin about participatory sensing. This is essentially about people collectively compiling data, for instance using their cell phones (since that is today’s most ubiquitous and easy-to-use data collection device). Estrin describes an application of participatory sensing, What’s Invasive, where people locate invasive plants using their iPhone or Android. This could be, for instance, in a national parks, where both employees and trekkers would be able to snap geo-coded photos (through GPS, although the photos do not strictly need to be geo-coded; they can be annotated later through a website). There’s a strong overlap with citizen science here.

Estrin also briefly describes an interesting application which traces your own path through a city over days, weeks or years and mashes up the spatial information with data on air quality. Air quality varies in different locations in a city and over time, but with this application you can get a pretty good approximation of the pollution you tend to get exposed to. This may prompt a change in your regular bike route, for instance. (Bonus link: The Beijing air quality Twitter feed)

Also, H+ has an interview with Pattie Maes, who delivered the stunning Sixth Sense TED talk, where she tried to show what it could be like to have a “sixth sense for data”, as she put it.

Body computing, preventive, predictive and social medicine

There have been many interesting articles and blog posts about the future of medicine, and specifically about the need to automatically monitor various physiological parameters, and, importantly, to start focusing more on health rather than disease; prevention rather than curing. The latter point has been stressed by Adam Bosworth, the former head of Google Health, in interviews like this one (audio) and this one (video, “The Body 2.0″). Bosworth is one of the founders of a company, Keas, that wants to help people understand their health data, set health goals and pursue them. He has a new blog post where he talks about machine learning in the context of health care. He (probably rightly) sees health care as lagging behind in adoption of predictive analytics. But he thinks this will change:

All the systems emerging to help consumers get personalized advice and information about their health are going to be incredible treasure troves of data about what works. And this will be a virtuous cycle. As the systems learn, they will encourage consumers to increasingly flow data into them for better more personalized advice and encourage physicians to do the same and then this data will help these systems to learn even more rapidly. I predict now that within a decade, no practicing physician will consider treating their patients without the support/advice of the expertise embodied in the machine learning that will have taken place. And finally, we will truly move to an evidence based health care system.

Along similar lines, the Broader Perspective blog writes about the “three tiers of medicine” that may make up the future healthcare system. The first tier consists of automated health monitoring tools that collect information about your health, The second tier is about preventive medicine and involves “health coaches”, who “…incorporate genomic data, together with family history and current phenotype and biomarker data into an overall care plan“. Finally, the third tier is the traditional health care system of today (hospitals, doctors, nurses).

I learned a new term for the enabling technology for the first (data-collection) tier: body computing. The Third Body Computing Conference will be hosted by the University of Southern California on Friday (9 October). The conference’s definition of body computing is that

“Body Computing” refers to an implanted wireless device, which can transmit up-to-the-second physiologic data to physicians, patients, and patients’ loved ones.

A new article about the future of health care in Fast Company also talks about body computing and predictive/preventive health care:

Wireless monitoring and communication devices are becoming a part of our everyday lives. Integrated into our daily activities, these devices unobtrusively collect information for us. For example, instead of doing an annual health checkup (i.e. cardiac risk assessment), near real-time health data access can be used to provide rolling assessments and alert patients of changes to their health risk based on biometrics assessment and monitoring (blood pressure, weight, sleep etc). With predictive health analytics, health information intelligence, and data visualization, major risks or abnormalities can be detected and sent to the doctor, possibly preempting complications such as stroke, heart attack, or kidney disease.

Although the article is named The Future of Health Care Is Social, it actually talks mostly about self-tracking and predictive analytics. It does go into social aspects of future healthcare, like online health/disease-related networks such as PatientsLikeMe or CureTogether. All in all, a nice article.

And finally (if anyone is still awakw), it has been widely reported that IBM has joined the sequencing fray and are trying to develop a nanopore-based system, a “DNA transistor”, for cheap sequencing. There are now several players in this area (for example, Oxford Nanopore, Pacific Biosystems, NABSYS) and some of them are bound to lose out – time will tell who will emerge on top. Anyway, the reason I mentioned this is partly that IBM explicitly connected this announcement to healthcare reform and personalized healthcare (IBM CEO also wants to resequence the health-care system) and partly because of the surprising (to me) fact that “[...] IBM also manages the entire health system for Denmark.” Really?

By the way, a good way to get updates on body computing is to follow Dr Leslie Saxon on Twitter.

Good issue of H+

I’ve been casually following h+, the latest magazine by R U Sirius, the man behind the Mondo 2000 and many other subsequent publications, but until the latest issue (#4) I’ve always felt it’s been a bit too … let’s say transhumanist for my tastes. This issue, though, is quite nice. There are articles about open source medicine, how it feels to have a new sense (perfect sense of direction, in the form of a device called Northpaw), augmented reality on cell phones (Layar etc.), eliminating suffering through brain science and genetic engineering, and much more.

Perhaps the most interesting article from the point of view of this blog is called Open prediction – How sports fans can help save the world. It’s about On the Record Sports, a sports prediction site where sports fans make open predictions, visible to all. The idea is that by aggregating predictions from lots of users (you could call it crowdsourced predictions), you are likely to get a better prediction than if you had asked a single, though knowledgeable, person. However, since the predictions are all open, you can also track your own (and others’) prediction performance and relate it to everyone else’s.

The fact that large groups of people tend to to well when their guesses are combined is well-known and has been discussed at length in, for example, Ian Ayres’ book Supercrunchers. Even aggregations of different prediction algorithms (such as in the machine learning techniques bagging and boosting) usually work well – as evidenced by the recently completed NetFlix competition – presumably because algorithms (like people) make prediction errors due to slightly different biases, which can be smoothed out in a combined approach.

On The Record Sports may therefore soon be sitting on a very interesting database of aggregated predictions. (At what point will they just go to the bookmaker and bet the farm using the crowdsourced predictions?) The h+ article mentions that the company has a pending patent related to the notion of prediction as entertainment, which sounds intriguing. Of course, prediction can be a sport in itself, as the article points out.

All in all, worth checking out.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 46 other followers