Follow the Data

A data driven blog

Archive for the category “Hacks”

Is it unusually cold today?

The frequently miserable Swedish weather often makes me think “Is it just me, or is it unusually cold today?” Occasionally, it’s the reverse scenario – “Hmm, seems weirdly warm for April 1st – I wonder what the typical temperature this time of year is?” So I made myself a little Shiny app which is now hosted here. I realize it’s not so interesting for people who don’t live in Stockholm, but then again I have many readers who do … and it would be dead simple to create the same app for another Swedish location, and probably many other locations as well.

The app uses three different data sources, all from the Swedish Meteorological and Hydrological Institute (SMHI). The estimate of the current temperature is taken from the “latest hour” data for Stockholm-Bromma (query). For the historical temperature data, I use two different sources with different granularity. There is a data set that goes back to 1756 which contains daily averages, and another one that goes back to 1961 but which has temperatures at 06:00 (6 am), 12:00 (noon) and 18:00 (6 pm). The latter one makes it easier to compare to the current temperature, at least if you happen to be close to one of those times.

 

Hacking open government data

I spent last weekend with my talented colleagues Robin Andéer and Johan Dahlberg participating in the Hack For Sweden hackathon in Stockholm, where the idea is to find the most clever ways to make use of open data from government agencies. Several government entities were actively supporting and participating in this well-organized though perhaps slightly unfortunately named event (I got a few chuckles from acquaintances when I mentioned my participations.)

Our idea was to use data from Kolada, a database containing more than 2000 KPIs (key performance indicators) for different aspects of life in the 290 Swedish municipalities (think “towns” or “cities”, although the correspondence is not exactly 1-to-1), to get a “birds-eye view” of how similar or different the municipalities/towns are in general. Kolada has an API that allows piecemeal retrieval of these KPIs, so we started by essentially scraping the database (a bulk download option would have been nice!) to get a table of 2,303 times 290 data points, which we then wanted to be able to visualize and explore in an interactive way.

One of the points behind this app is that it is quite hard to wrap your head around the large number of performance indicators, which might be a considerable mental barrier for someone trying to do statistical analysis on Swedish municipalities. We hoped to create a “jumping-board” where you can quickly get a sense on what is distinctive for each municipality and which variables might be of interest, after which a user would be able to go deeper into a certain direction of analysis.

We ended up using the Bokeh library for Python to make a visualization where the user can select municipalities and drill down a little bit to the underlying data, and Robin and Johan cobbled together a web interface (available at http://www.kommunvis.org).  We plotted the municipalities using principal component analysis (PCA) projections after having tried and discarded alternatives like MDS and t-SNE. When the user selects a town in the PCA plot, the web interface displays its most distinctive (i.e. least typical) characteristics. It’s also possible to select two towns and get a list of the KPIs that differ the most between the two towns (based on ranks across all towns). Note that all of the KPIs are named and described in Swedish, which may make the whole thing rather pointless for non-Swedish users.

The code is on GitHub and the current incarnation of the app is at Kommunvis.

Perhaps unsurprisingly, there were lots of cool projects on display at Hack for Sweden. The overall winners were the Ge0Hack3rs team, who built a striking 3D visualization of different parameters for Stockholm (e.g. the density of companies, restaurants etc.) as an aid for urban planners and visitors. A straightforward but useful service which I liked was Cykelranking, built by the Sweco Position team, an index for how well each municipality is doing in terms of providing opportunities for bicycling, including detailed info on bicycle paths and accident-prone locations.

This was the third time that the yearly Hack for Sweden event was held, and I think the organization was top-notch, in large, spacey locations with seemingly infinite supply of coffee, food, and snacks, as well as helpful government agency data specialists in green T-shirts whom you were able to consult with questions. We definitely hope to be back next year with fresh new ideas.

This was more or less a 24-hour hackathon (Saturday morning to Sunday morning), although certainly our team used less time (we all went home to sleep on Saturday evening), yet a lot of the apps built were quite impressive, so I asked some other teams how much they had prepared in advance. All of them claimed not to have prepared anything, but I suspect most teams did like ours did (and for which I am grateful): prepared a little dummy/bare-bones application just to make sure they wouldn’t get stuck in configuration, registering accounts etc. on the competition day. I think it’s a good thing in general to require (as this hackathon did) that the competitors state clearly in advance what they intend to do, and prod them a little bit to prepare in advance so that they can really focus on building functionality on the day(s) of the hackathon instead of fumbling around with installation.

 

 

ASCII Autoencoder

Joel and I were playing around with TensorFlow, the deep learning library that Google recently released and that you have no doubt heard of. We had put together a little autoencoder implementation and were trying to get a handle on how well it was working.

An autoencoder can be viewed as a neural network where the final layer, the output layer, is supposed to reconstruct the values that have been fed into the input layer, possibly after some distortion of the inputs (like forcing a fraction of them to be zero, dropout, or adding some random noise). In the case with corrupted, it’s called a denoising autoencoder, and the purpose of adding the noise or dropout is to make the system discover more robust statistical regularities in the input data (there is some good discussion here).

An autoencoder often has fewer nodes in the hidden layer(s) than in the input and is then used to learn a more compact and general representation of the data (the code or encoding). With only one hidden layer and linear activation functions, the encoding should be essentially the same as one gets from PCA (principal component analysis), but non-linear activation functions (e g sigmoid and tanh) will yield different representations, and multi-layer or stacked autoencoders will add a hierarchical aspect.

Some references on autoencoders:

Ballard (1987) – Modular learning in neural networks

Andrew Ng’s lecture notes on sparse autoencoders

Vincent et al (2010) – Stacked denoising autoencoders

Tan et al (2015) – ADAGE analysis of publicly available gene expression data collections illuminates Pseudomonas aeruginosa-host interactions

Anyway, we were trying some different parametrizations of the autoencoder (its training performance can depend quite a lot on how the weights are initialized, the learning rate and the number of hidden nodes) and felt it was a bit boring to just look at a single number (the reconstruction error). We wanted to get a feel for how training is progressing across the input data matrix, so we made the script output for each 1000 rounds of training a colored block of text in the terminal where the background color represents the absolute difference between the target value and the reconstructed value using bins. The “best” bin (bin 0) is dark green and represents that the reconstruction is very close to the original input; the “bad” bins have reddish colors. If the data point has been shifted t0 a new bin in the last 1000 rounds (i e the reconstruction has improved or deteriorated noticeably), a colored digit indicating the new bin is shown in the foreground. (This makes more sense when you actually look at it.) We only show the first 75 training examples and the first 75 features, so if your data set is larger than that you won’t see all of it.

The code is on GitHub. There are command-line switches for controlling the number of hidden nodes, learning rate, and other such things. There are probably many aspects that could be improved but we thought this was a fun way to visualize the progress and see if there are any regions that clearly stand out in some way.

Here are a few screenshots of an example execution of the script.

As the training progresses, the overall picture gets a bit greener (the reconstruction gets closer to the input values) and the reconstructions get a bit more stable (i e not as many values have a digit on them to indicate that the reconstruction has improved or deteriorated). The values under each screenshot indicates the number of training cycles and the mean squared reconstruction error.

Post Navigation