Follow the Data

A data driven blog

Archive for the category “Events”

Hacking open government data

I spent last weekend with my talented colleagues Robin Andéer and Johan Dahlberg participating in the Hack For Sweden hackathon in Stockholm, where the idea is to find the most clever ways to make use of open data from government agencies. Several government entities were actively supporting and participating in this well-organized though perhaps slightly unfortunately named event (I got a few chuckles from acquaintances when I mentioned my participations.)

Our idea was to use data from Kolada, a database containing more than 2000 KPIs (key performance indicators) for different aspects of life in the 290 Swedish municipalities (think “towns” or “cities”, although the correspondence is not exactly 1-to-1), to get a “birds-eye view” of how similar or different the municipalities/towns are in general. Kolada has an API that allows piecemeal retrieval of these KPIs, so we started by essentially scraping the database (a bulk download option would have been nice!) to get a table of 2,303 times 290 data points, which we then wanted to be able to visualize and explore in an interactive way.

One of the points behind this app is that it is quite hard to wrap your head around the large number of performance indicators, which might be a considerable mental barrier for someone trying to do statistical analysis on Swedish municipalities. We hoped to create a “jumping-board” where you can quickly get a sense on what is distinctive for each municipality and which variables might be of interest, after which a user would be able to go deeper into a certain direction of analysis.

We ended up using the Bokeh library for Python to make a visualization where the user can select municipalities and drill down a little bit to the underlying data, and Robin and Johan cobbled together a web interface (available at http://www.kommunvis.org).  We plotted the municipalities using principal component analysis (PCA) projections after having tried and discarded alternatives like MDS and t-SNE. When the user selects a town in the PCA plot, the web interface displays its most distinctive (i.e. least typical) characteristics. It’s also possible to select two towns and get a list of the KPIs that differ the most between the two towns (based on ranks across all towns). Note that all of the KPIs are named and described in Swedish, which may make the whole thing rather pointless for non-Swedish users.

The code is on GitHub and the current incarnation of the app is at Kommunvis.

Perhaps unsurprisingly, there were lots of cool projects on display at Hack for Sweden. The overall winners were the Ge0Hack3rs team, who built a striking 3D visualization of different parameters for Stockholm (e.g. the density of companies, restaurants etc.) as an aid for urban planners and visitors. A straightforward but useful service which I liked was Cykelranking, built by the Sweco Position team, an index for how well each municipality is doing in terms of providing opportunities for bicycling, including detailed info on bicycle paths and accident-prone locations.

This was the third time that the yearly Hack for Sweden event was held, and I think the organization was top-notch, in large, spacey locations with seemingly infinite supply of coffee, food, and snacks, as well as helpful government agency data specialists in green T-shirts whom you were able to consult with questions. We definitely hope to be back next year with fresh new ideas.

This was more or less a 24-hour hackathon (Saturday morning to Sunday morning), although certainly our team used less time (we all went home to sleep on Saturday evening), yet a lot of the apps built were quite impressive, so I asked some other teams how much they had prepared in advance. All of them claimed not to have prepared anything, but I suspect most teams did like ours did (and for which I am grateful): prepared a little dummy/bare-bones application just to make sure they wouldn’t get stuck in configuration, registering accounts etc. on the competition day. I think it’s a good thing in general to require (as this hackathon did) that the competitors state clearly in advance what they intend to do, and prod them a little bit to prepare in advance so that they can really focus on building functionality on the day(s) of the hackathon instead of fumbling around with installation.

 

 

Quick notes

  • I’ve found the Data Skeptic to be a nice podcast about data science and related subjects. For example, the “data myths” episode and the one with Matthew Russell (who wrote Mining the Social Web) are fun.
  • When I was in China last month, the seat pocket in front of me in the cab we took from the Beijing airport had a glossy magazine in it. The first feature article was about big data (大数据) analysis applied to Chinese TV series and movies, Netflix-style. Gotta beat those Korean dramas! One of the hotels we stayed in Beijing had organized an international conference on big data analytics the day before we arrived at the hotel. The signs and posters were still there. Anecdotes, not data, but still.
  • November was a good meetup month in Stockholm. The Machine Learning group had another good event at Spotify HQ, with interesting presentations from Watty , both about how to “data bootstrap” a startup when you discover that the existing data you’ve acquired is garbage and need to start generating your own in a hurry, and about the actual nitty gritty details of their algorithms (which model and predict energy consumption from different devices in households by deconvoluting a composite signal), and also about embodied cognition and robotics by Jorge Davila-Chacon (slides here). Also, in an effort to revive the Stockholm Big Data group, I co-organized (together with Stefan Avestad from Ericsson) a meetup with Paco Nathan on Spark. The slides for the talk, which was excellent and extremely appreciated by the audience, can be found here. Paco also gave a great workshop the next day on how to actually use Spark. Finally, I’ve joined the organizing committee of SRUG, the Stockholm R useR group, and have started to plan some future meetups there. The next one will be on December 9 and will deal with how Swedish governmental organizations use R.
  • Erik Bernhardsson of Spotify has written a fascinating blog post combining two of my favorite subjects: chess and deep learning. He has trained a 3 layer deep and 2048 unit wide network on 100 million games from FICS (the Free Internet Chess Server, where I, incidentally, play quite often). I’ve often thought about why it seems to be so hard to build a chess engine that really learns the game from scratch, using actual machine learning, rather than the rule- and heuristic based programs that have ruled the roost, and which have been pre-loaded with massive opening libraries and endgame tablebases (giving the optimal move in any position with less than N pieces; I think that N is currently about =<7). It would be much cooler to have a system that just learns implicitly how to play and does not rely on knowledge. Well, Erik seems to have achieved that, kind of. The cool thing is that this program does not need to be told explicitly how the pieces move; it can infer it from data. Since the system is using amateur games, it sensibly enough does not care about the outcome of each game (that would be a weak label for learning). I do think that Erik is a bit optimistic when he writes that “Still, even an amateur player probably makes near-optimal moves for most time.” Most people who have analyzed their own games, or online games, with a strong engine know that amateur games are just riddled with blunders. (I remember the old Max Euwe book “Chess master vs chess amateur”, which also demonstrated this convincingly … but I digress).  Still, a very impressive demonstration! I once supervised a master’s thesis where the aim was to teach a neural network to play some specific endgames, and even that was a challenge. As Erik notes in his blog post, his system needs to be tried against a “real” chess engine. It is reported to score around 33% against Sunfish, but that is a fairly weak engine, as I found out by playing it half and hour ago.

Deep learning and genomics?

Yesterday, I attended an excellent meetup organized by the Stockholm Machine Learning meetup group at Spotify’s headquarters. There were two presentations: First one by Pawel Herman, who gave a very good general introduction into the roots, history, present and future of deep learning, and a more applied talk by Josephine Sullivan, where she showed some impressive results obtained by her group in image recognition as detailed in a recent paper titled “CNN features off-the-shelf: An astounding baseline for recognition” [pdf]. I’m told that slides from the presentations will be posted on the meetup web page soon.

Anyway, this meetup naturally got me thinking about whether deep learning could be used for genomics in some fruitful way. At first blush it does not seem like a good match: deep learning models have an enormous number of parameters and mostly seem to be useful with a very large number of training examples (although not as many as the number of parameters perhaps). Unfortunately, the sample sizes in genomics are usually small – it’s a very small n, large p domain at least in a general sense.

I wonder whether it would make sense to throw a large number of published human gene expression data sets (microarray or RNA-seq; there should be thousands of these now) into a deep learner to see what happens. The idea would not necessarily be to create a good classification model, but rather to learn a good hierarchical representation of gene expression patterns. Both Pawel and Josephine stressed that one of the points of deep learning is to automatically learn a good multi-level data representation, such as a set of more and more abstract set of visual categories in the case of picture classification. Perhaps we could learn something about abstract transcriptional states on various levels. Or not.

There are currently two strains of genomics that I feel are especially interesting from a “big data” perspective, namely single-cell transcriptomics and metagenomics (or metatranscriptomics, metaproteomics and what have you). Perhaps deep learning could actually be a good paradigm for analyzing single-cell transcriptomics (single-cell RNA-seq) data. Some researchers are talking about generating tens of thousands of single-cell expression profiles. The semi-redundant information obtained from many similar but not identical profiles is reminiscent of the redundant visual features that deep learning methods like to consume as input (according to the talks yesterday). Maybe this type of data would fit better than the “published microarray data” idea above.

For metagenomics (or meta-X-omics), it’s harder to speculate on what a useful deep learning solution would be. I suppose one could try to feed millions or billions of bits of sequences (k-mers) to a deep learning system in the hope of learning some regularities in the data. However, it was also mentioned at the meetup that deep learning methods still have a ways to go when it comes to natural language processing, and it seems to me that DNA “words” are closer to natural language than they are to pixel data.

I suppose we will find out eventually what can be done in this field now that Google has joined the genomics party!

Online analysis contests and animal testing

I’d like to draw your attention to two online data analysis challenges that both, in their way, address drug testing on animals and how results of such testing translate to human physiology.

CAMDA 2013 (12th international conference on critical assessment of massive data analysis) is a conference that focuses on massive data sets in the life sciences. This year, it has two associated analysis challenges, one of which is “prediction of drug compatibility from an extremely large toxicogenomic data set.” The data set used in this challenge contains over dataset contains over 20,000 genome expression microarrays, each measuring perhaps about 20,000 genes in the liver of rats treated with mainly human drugs. There are two questions that the organizers want to address:

  • Question 1: Can we replace animal studies with in vitro assays? [“in vitro” literally means “in glass”, for instance in a test tube]
  • Question 2: Can we predict liver injury in humans using toxicogenomics data from animals?

Meanwhile, the SBV (systems biology verification) Improver project, which ran a prediction contest last year that was covered in this blog, is starting its new Species Translation Challenge,  which also aims to address how “translatable” biological events in rats or mice are to humans. This challenge, which has four sub-challenges, aims to answer the following questions:

  • Can the perturbations of signaling pathways in one species predict the response to a given stimulus in another species?
  • Which biological pathways, functions and gene expression profiles are most robustly translated?
  • Which gene expression profiles and associated biological pathways / functions are most robustly translated?
  • Does translation depend on the nature of the stimulus or data type collected such as protein phosphorylation and cytokine responses?
  • Which computational methods are most effective for inferring gene, phosphorylation and pathway response from one species to another?

I think it will be very interesting to see how these challenges play out and to compare their respective outcomes.

BigData.SG and The human face of big data

By an amazing coincidence, I was able to attend a session of the Singapore big data meetup group, BigData.SG, after having attended the NGS Asia 2012 conference here in the Lion City. This group was started earlier this year and tries to meet once a month (a more ambitious schedule than the Stockholm group.) Today, about 40 people were in attendance, and I had a nice time chatting to some of them. The invited speaker was Michael Howard, VP of marketing at Greenplum. He had one nice quip – “big data means so little to so many” and talked a little bit about Chorus, a collaborative data science platform from Greenplum which I hadn’t heard about. He hinted that Chorus and Kaggle have something big going on together – something that will revolutionize the whole crowdsourced prediction “business.” It will be interesting to see what it is.
Earlier today, Howard had announced the Human Face of Big Data project, which has been / will be launched in several cities all over the world today (probably still hasn’t launched in the US).  The project, which “lets people compare themselves to each other”, uses a downloadable app (for Android; the iOS version wasn’t working yet) that you can use to collect data about yourself with. There is “passive data collection”: how far and at what speed you’ve moved, how many Bluetooth hot spots you’ve passed, and so on, and active collection through questions that the app asks you; either “serious” questions such as whether you would modify the genes of your unborn infant if given the opportunity (and if so, what would you improve – immune system, intelligence, …) – apparently men and women answered this very differently – or more open-ended “fantasy” questions.

The app also lets you find your “data doppelganger”, which is of course the user who is most similar to you in terms of the collected data. Howard said that despite the short time since the launch, the app has already yielded interesting information about gender differences and topics of interest.

Stockholm Big Data Meetup

The first meetup of the Stockholm Big Data group was organized yesterday (Sep 6 2012) by Mikael Hussain at the Klarna headquarters. The audience was packed, with close to a 100 people attending and others unfortunately left out (due to fire regulations.) Apparently a lot of people (including us) had been thirsting for this sort of event.

The format was 1.5h of rapid talks (supposed to be 10 min each but probably a bit longer in practice) on widely different topics – we will refer to Marina Santini’s excellent writeup for details on the talks – followed by socializing in the pub around the corner. Follow the Data was represented by me (Mikael) as I gave a short talk about the benefits of competing in (and organizing) online prediction contests.

During the course of the event, I learned about three companies that I didn’t know about and who are all actively looking for analytics and big data talent:

  • Campanja – online advertising, heavily into Erlang and AI. Looking to fill several positions of different kinds
  • Svensk Lånemarknad (~Swedish Loan Exchange?) – help customers find the best banks and loans for them – looking to fill a predictive analytics position
  • Tink – not quite sure what they are doing (the home page is a bit cryptic) – looking for developers

I’m sure there were other companies as well looking to recruit – I only had time to talk to a small fraction of the participants, obviously!

All in all, I think the meetup was a lot of fun and I am looking forward to more meetups in Stockholm soon.

Meetup groups for Big Data & Predictive Modeling and Quantified Self in Stockholm

Two interesting new meetup groups have formed in Stockholm (well, there are other interesting ones but for the purposes of this blog these two are the most exciting):

Fun!

Health Hack Day ’12: Day 1 impressions

So as mentioned in the previous post, Health Hack Day ’12 in Stockholm is underway right now; it started with a number of lectures and a party yesterday and the actual hacking will start today, with the winning apps to be presented tomorrow. You can follow the #hhd12 hashtag on Twitter or go to the link above to see the recorded lectures.

I thought the arrangements and speaker line-up yesterday were surprisingly good, which bodes well for the survival of the Health Hack Day concept, in fact I’m sure they will be back next year. The lectures (which were recorded and can be viewed online at the link above) were given in a smallish space (part of a fin de siècle apartment complex now used as an office hotel for creative types, located near Stureplan in central Stockholm) decorated with thousands of yellow strips of paper hanging down from the ceiling – a nice-looking installation which also provided some relief from the heat in the room when the wind occasionally blew in through the window and turned the paper strips into a giant ceiling fan. Meanwhile, visitors could sip some excellent free coffee (from Stockholm roast).

Hoa Ly is a young, enterprising fellow who works for Psykologifabriken (“The Psychology Factory”) and his own sister company Hoa’s Tool Shop (both of these companies were involved in arranging the event), as well as doing clinical psychology research at Linköping university and being a successful DJ. He talked about behavior change through digital tools, exemplifying with the Viary mobile & web app which has been used successfully for depression treatment but, as I understand it, is quite general in nature so you could track any kind of behavior & goals (incidentally, the statistics interface looks a lot like the WordPress interface where I look at access statistics for this blog!) Hoa also talked about correlating data from different sources like Viary, the Zeo sleep tracker and exercise data from heiaheia.com. Integrating data from different sources is of course very interesting but I didn’t feel we quite got any really solid concrete examples here, just a general sense that it should be useful. Anyway. The most intriguing part of Hoa’s talk was when he described the launch of a new project to “disrupt the whole dance music industry” (or words to that effect). The idea is to treat DJ performances as scientific experiments and “gather data from the audience”, for instance by measuring adrenaline levels in response to song selections. Hoa and his partners have created a new  country called Yamarill (link in Swedish) to construct a narrative around which this project will be built. The inauguration of the new country will apparently be celebrated on June 1 at the Hoa’s Tool Shop office spaces. The Yamarill “delegation” has already played several DJ gigs “combining electronic dance music, technology and psychology” as they say in the linked interview (I might also add “quirky clothes”).

Pernilla Rydmark from .SE talked about different forms of crowdfunding and presented five Swedish platforms for it. .SE is also introducing an interesting form of funding called “guaranteed funding” where they pick projects that are already popular on crowdfunding platforms and promise to fund them up to their stated goal in case they don’t succeed in reaching it through the crowdfunding platform. Thus, the goal of the funding is rather paradoxically that no one should get it (because .SE is hoping that the projects will get fully funded by the crowd.)

Bill Day from Runkeeper talked about the need for an open, global health platform and presented HealthGraph, a free platform with tens or millions of users initiated by the RunKeeper team but which is expanding far beyond that community.

Mathias Karlsson from Calmark presented his company’s approach to rapid blood biomarker testing, which is making consumable platforms for colorimetric assays (the measurement of interest is transformed into a color) which can be analyzed on the spot using, for example, a smartphone camera. He brought a developer team who will attempt to build a new test (for bilirubin) into the platform in 24 hours during the hackathon part of the event.

Linus Bengtsson from FlowMinder described intriguing reality mining (or in less spectacular terms, call log analysis) work where data from mobile phone providers was used to track the movements of people during and after the Haiti earthquake, and the subsequent cholera outbreak. Linus and his team tracked 1.9 million SIM cards from Port-Au-Prince residents to obtain their estimates on migration patterns. FlowMinder is a non-profit and provides free analysis of the same kind during any kind of global disaster (in collaboration with mobile telephony providers, naturally.)

Sara Eriksson and Johan Nilsson from United Minds talked about the “new health”, including a lot of topics that have been frequently mentioned on this blog, like 23andme, PatientsLikeMe, and even the MinION sequencer from Oxford Nanopore. I had heard / thought about most of it before but what I took away from it was the concept of “biosociality” as coined by Paul Rabinow, and also that only 37% of surveyed Stockholm smart phone users did *not* want to collect data on themselves through the phone; a whopping 59% wanted not only to collect the data but to analyze it themselves.

Megan Miller from Bonnier (a Swedish media company which has an enormous influence in the media here; however Megan was working for its US branch) described Teemo, a platform for “digital wellness”, with components of collaborative adventuring and social exercise (you try to accomplish “quests” together with your friends by exercising.) Teemo looks like it has a pretty nifty design, inspired by paper cuts and Nordic (=Helsinki?) design style. As Megan put it, Teemo wants to “put fun first and track behavior in the background.)

We will see whether Follow the Data has the energy to visit again tomorrow and see what apps have come out of the hackathon, which should be starting in a few hours from now!

Post Navigation