Follow the Data

A data driven blog

Archive for the tag “meetup”

Quick notes

  • I’ve found the Data Skeptic to be a nice podcast about data science and related subjects. For example, the “data myths” episode and the one with Matthew Russell (who wrote Mining the Social Web) are fun.
  • When I was in China last month, the seat pocket in front of me in the cab we took from the Beijing airport had a glossy magazine in it. The first feature article was about big data (大数据) analysis applied to Chinese TV series and movies, Netflix-style. Gotta beat those Korean dramas! One of the hotels we stayed in Beijing had organized an international conference on big data analytics the day before we arrived at the hotel. The signs and posters were still there. Anecdotes, not data, but still.
  • November was a good meetup month in Stockholm. The Machine Learning group had another good event at Spotify HQ, with interesting presentations from Watty , both about how to “data bootstrap” a startup when you discover that the existing data you’ve acquired is garbage and need to start generating your own in a hurry, and about the actual nitty gritty details of their algorithms (which model and predict energy consumption from different devices in households by deconvoluting a composite signal), and also about embodied cognition and robotics by Jorge Davila-Chacon (slides here). Also, in an effort to revive the Stockholm Big Data group, I co-organized (together with Stefan Avestad from Ericsson) a meetup with Paco Nathan on Spark. The slides for the talk, which was excellent and extremely appreciated by the audience, can be found here. Paco also gave a great workshop the next day on how to actually use Spark. Finally, I’ve joined the organizing committee of SRUG, the Stockholm R useR group, and have started to plan some future meetups there. The next one will be on December 9 and will deal with how Swedish governmental organizations use R.
  • Erik Bernhardsson of Spotify has written a fascinating blog post combining two of my favorite subjects: chess and deep learning. He has trained a 3 layer deep and 2048 unit wide network on 100 million games from FICS (the Free Internet Chess Server, where I, incidentally, play quite often). I’ve often thought about why it seems to be so hard to build a chess engine that really learns the game from scratch, using actual machine learning, rather than the rule- and heuristic based programs that have ruled the roost, and which have been pre-loaded with massive opening libraries and endgame tablebases (giving the optimal move in any position with less than N pieces; I think that N is currently about =<7). It would be much cooler to have a system that just learns implicitly how to play and does not rely on knowledge. Well, Erik seems to have achieved that, kind of. The cool thing is that this program does not need to be told explicitly how the pieces move; it can infer it from data. Since the system is using amateur games, it sensibly enough does not care about the outcome of each game (that would be a weak label for learning). I do think that Erik is a bit optimistic when he writes that “Still, even an amateur player probably makes near-optimal moves for most time.” Most people who have analyzed their own games, or online games, with a strong engine know that amateur games are just riddled with blunders. (I remember the old Max Euwe book “Chess master vs chess amateur”, which also demonstrated this convincingly … but I digress).  Still, a very impressive demonstration! I once supervised a master’s thesis where the aim was to teach a neural network to play some specific endgames, and even that was a challenge. As Erik notes in his blog post, his system needs to be tried against a “real” chess engine. It is reported to score around 33% against Sunfish, but that is a fairly weak engine, as I found out by playing it half and hour ago.
Advertisements

Videos for rainy days

If, like me, you are on vacation, you might have time to watch some of these cool videos on a rainy day (unless you read books instead, as you probably should):

Video from the Europe-wide machine learning meetup on June 15. Andrew Ng’s talk is probably the highlight, but I also enjoyed Muthu Muthukrishnan’s “On Sketching” and Sam Bessalah’s “Abstract algebra for stream learning.”


Video (+ slides) from the Deep Learning meetup in Stockholm on June 9
I saw these live and they were quite good. After some preliminaries, the first presentation (by Pawel Herman) starts at around 00:06:00 in the video and the second presentation (by Josephine Sullivan) starts at around 1:18:30.


Videos from the Big Data in Biomedicine event at Stanford
. Obviously I haven’t seen all of these, but the ones I have seen have been of really high quality. I particularly enjoyed the talk by Google’s David Glazer on the search behemoth’s efforts in genomics and Sandrine Dudoit on the role of statisticians in data science (where she echoes to some extent Terry Speed’s pessimistic views), but I think all of the talks are worth watching.

Hadley Wickham lecture: ggvis, tidyr, dplyr and much more

Another week, another great meetup. This time, the very prolific Hadley Wickham visited the Stockholm R useR group and talked for about an hour about his new projects.

Perhaps some background is in order. Hadleys PhD thesis (free pdf here) is a very inspiring tour of different aspects of practical data analysis issues, such as reshaping data into a “tidy” for that is easy to work with (he developed the R reshape package for this), visualizing clustering and classification problems (see his classifly, clusterfly, and meifly packages) and creating a consistent language for describing plots and graphics (which resulted in the influential ggplot2 package). He has also made the plyr package as a more consistent version of the various “apply” functions in R. I learned a lot from this thesis.

Today, Hadley talked about several new packages that he has been developing to further improve on his earlier toolkit. He said that in general, his packages become simpler and simpler as he re-defines the basic operations needed for data analysis.

  • The newest one (“I wrote it about four days ago”, Hadley said) is called tidyr (it’s not yet on CRAN but can be installed from GitHub) and provides functions for getting data into the “tidy” format mentioned above. While reshape had the melt and cast commands, tidyr has gather, separate, and spread.
  • dplyr – the “next iteration of plyr”, which is faster and focuses on data frames. It uses commands like select, filter, mutate, summarize, arrange.
  • ggvis – a “dynamic version of ggplot2” which is designed for responsive dynamic graphics, streaming visualization and meant for the web. This looked really nice. For example, you can easily add sliders to a plot so you can change the parameters and watch how the plot changes in real time. ggvis is built on Shiny but provides easier ways to make the plots. You can even embed dynamic ggvis plots in R markdown documents with knitR so that the resulting report can contain sliders and other things. This is obviously not possible with PDFs though. ggvis will be released on CRAN “in a week or so”.

Hadley also highlighted the magrittr package which implements a pipe operator for R (Magritte/pipe … get it? (groan)) The pipe looks like %>% and at first blush it may not look like a big deal, but Hadley made a convincing case that using the pipe together with (for example) dplyr results in code that is much easier to read, write and debug.

Hadley is writing a book, Advanced R (wiki version here), which he said has taught him a lot about the inner workings of R. He mentioned Rcpp as an excellent way to write C++ code and embed it in R packages. The bigvis package was mentioned as a “proof of concept” of how one might visualize big data sets (where the number of data points is larger than the number of pixels on the screen, so it is physically impossible to plot everything and summarization is necessary.)

Deep learning and genomics?

Yesterday, I attended an excellent meetup organized by the Stockholm Machine Learning meetup group at Spotify’s headquarters. There were two presentations: First one by Pawel Herman, who gave a very good general introduction into the roots, history, present and future of deep learning, and a more applied talk by Josephine Sullivan, where she showed some impressive results obtained by her group in image recognition as detailed in a recent paper titled “CNN features off-the-shelf: An astounding baseline for recognition” [pdf]. I’m told that slides from the presentations will be posted on the meetup web page soon.

Anyway, this meetup naturally got me thinking about whether deep learning could be used for genomics in some fruitful way. At first blush it does not seem like a good match: deep learning models have an enormous number of parameters and mostly seem to be useful with a very large number of training examples (although not as many as the number of parameters perhaps). Unfortunately, the sample sizes in genomics are usually small – it’s a very small n, large p domain at least in a general sense.

I wonder whether it would make sense to throw a large number of published human gene expression data sets (microarray or RNA-seq; there should be thousands of these now) into a deep learner to see what happens. The idea would not necessarily be to create a good classification model, but rather to learn a good hierarchical representation of gene expression patterns. Both Pawel and Josephine stressed that one of the points of deep learning is to automatically learn a good multi-level data representation, such as a set of more and more abstract set of visual categories in the case of picture classification. Perhaps we could learn something about abstract transcriptional states on various levels. Or not.

There are currently two strains of genomics that I feel are especially interesting from a “big data” perspective, namely single-cell transcriptomics and metagenomics (or metatranscriptomics, metaproteomics and what have you). Perhaps deep learning could actually be a good paradigm for analyzing single-cell transcriptomics (single-cell RNA-seq) data. Some researchers are talking about generating tens of thousands of single-cell expression profiles. The semi-redundant information obtained from many similar but not identical profiles is reminiscent of the redundant visual features that deep learning methods like to consume as input (according to the talks yesterday). Maybe this type of data would fit better than the “published microarray data” idea above.

For metagenomics (or meta-X-omics), it’s harder to speculate on what a useful deep learning solution would be. I suppose one could try to feed millions or billions of bits of sequences (k-mers) to a deep learning system in the hope of learning some regularities in the data. However, it was also mentioned at the meetup that deep learning methods still have a ways to go when it comes to natural language processing, and it seems to me that DNA “words” are closer to natural language than they are to pixel data.

I suppose we will find out eventually what can be done in this field now that Google has joined the genomics party!

Machine learning goings-on in Stockholm

The predictive analytics scene in Stockholm hasn’t been very vibrant, but at least we now have the Machine Learning Stockholm meetup group, which had its inaugural session hosted at Spotify on February 25 under the heading “Graph-parallel machine learning”. There was a short introduction to graph-centric computing paradigms and hands-on demos of GraphLab and Giraph.

The Stockholm R useR group has hosted good analytics-themed meetups from time to time. This past Saturday (March 8) On Saturday (March 29), they organized will organize a hackathon with two tracks: one predictive modelling contest about predicting flat prices (always a timely theme in this town) and one “R for beginners” track.

Finally, the Stockholm-based Watty are looking for a head of machine learning development (or maybe several machine learning engineers; see ad) to lead a team that will diagnose the energy use of buildings and work to minimize energy waste.

BigData.SG and The human face of big data

By an amazing coincidence, I was able to attend a session of the Singapore big data meetup group, BigData.SG, after having attended the NGS Asia 2012 conference here in the Lion City. This group was started earlier this year and tries to meet once a month (a more ambitious schedule than the Stockholm group.) Today, about 40 people were in attendance, and I had a nice time chatting to some of them. The invited speaker was Michael Howard, VP of marketing at Greenplum. He had one nice quip – “big data means so little to so many” and talked a little bit about Chorus, a collaborative data science platform from Greenplum which I hadn’t heard about. He hinted that Chorus and Kaggle have something big going on together – something that will revolutionize the whole crowdsourced prediction “business.” It will be interesting to see what it is.
Earlier today, Howard had announced the Human Face of Big Data project, which has been / will be launched in several cities all over the world today (probably still hasn’t launched in the US).  The project, which “lets people compare themselves to each other”, uses a downloadable app (for Android; the iOS version wasn’t working yet) that you can use to collect data about yourself with. There is “passive data collection”: how far and at what speed you’ve moved, how many Bluetooth hot spots you’ve passed, and so on, and active collection through questions that the app asks you; either “serious” questions such as whether you would modify the genes of your unborn infant if given the opportunity (and if so, what would you improve – immune system, intelligence, …) – apparently men and women answered this very differently – or more open-ended “fantasy” questions.

The app also lets you find your “data doppelganger”, which is of course the user who is most similar to you in terms of the collected data. Howard said that despite the short time since the launch, the app has already yielded interesting information about gender differences and topics of interest.

Stockholm R useR Group inaugural meeting

Yesterday, the Stockholm R useR group had its inaugural meeting, hosted by Statisticon, a statistical consulting firm with offices in the heart of the city. It was a smaller and more intimate affair than the Stockholm Big Data meetup last week, with perhaps 25 people attending. If my memory serves, the entities represented at the meeting were Statisticon itself, the Swedish Institute for Communicable Disease Control, Klarna, Stockholm University (researchers in 3 different fields), the Swedish Pensions Agency, and Karolinska Institutet.

There were two themes that came up again and again: firstly, reproducible dynamic reporting – everyone seemed to either use (and love) or want to learn Sweave (and to some extent knitr), and secondly, managing big data sets in R. Thus it was decided to focus on these for the next meeting: an expert from the group will give a presentation on Sweave, and another group of members will try to collect information on what is available for “big data” in R.

I thought it was interesting to see that the representatives from the Swedish Pensions Agency (there were 3 of them) seemed so committed to R, open source and open data. Nice! It was also mentioned that another employee of the same agency, who wasn’t present, has been developing his own big-data R package for dealing with the 9-million-row table containing pension-related data on all Swedish citizens.

Stockholm Big Data Meetup

The first meetup of the Stockholm Big Data group was organized yesterday (Sep 6 2012) by Mikael Hussain at the Klarna headquarters. The audience was packed, with close to a 100 people attending and others unfortunately left out (due to fire regulations.) Apparently a lot of people (including us) had been thirsting for this sort of event.

The format was 1.5h of rapid talks (supposed to be 10 min each but probably a bit longer in practice) on widely different topics – we will refer to Marina Santini’s excellent writeup for details on the talks – followed by socializing in the pub around the corner. Follow the Data was represented by me (Mikael) as I gave a short talk about the benefits of competing in (and organizing) online prediction contests.

During the course of the event, I learned about three companies that I didn’t know about and who are all actively looking for analytics and big data talent:

  • Campanja – online advertising, heavily into Erlang and AI. Looking to fill several positions of different kinds
  • Svensk Lånemarknad (~Swedish Loan Exchange?) – help customers find the best banks and loans for them – looking to fill a predictive analytics position
  • Tink – not quite sure what they are doing (the home page is a bit cryptic) – looking for developers

I’m sure there were other companies as well looking to recruit – I only had time to talk to a small fraction of the participants, obviously!

All in all, I think the meetup was a lot of fun and I am looking forward to more meetups in Stockholm soon.

Meetup groups for Big Data & Predictive Modeling and Quantified Self in Stockholm

Two interesting new meetup groups have formed in Stockholm (well, there are other interesting ones but for the purposes of this blog these two are the most exciting):

Fun!

Post Navigation