Follow the Data

A data driven blog

Archive for the month “March, 2012”

Follow the Data podcast, episode 1: Gavagai! Gavagai!

We have made available the first episode of the Follow the Data podcast! Hope you enjoy it.

Podcast link: Follow The Data | Episode 1 – Gavagai! Gavagai!

This first episode, as has been mentioned before on this blog, is about a Stockholm startup company, Gavagai, which provides a technology platform called Ethersource. We interviewed the company’s CDO (chief data officer), Fredrik Olsson, and the chief scientist, Magnus Sahlgren, and we think it resulted in a very interesting chat, although the sound quality is perhaps not ideal due to our inexperience with podcasting.

Some interesting tidbits from the conversation:

  • The name “Gavagai” comes from a thought experiment by Quine demonstrating the “indeterminacy of translation“. It’s also the reason for the presence of the little rabbit on the Gavagai web page.
  • Olsson describes Ethersource as a “semantic processing layer of the big data stack” and a “base technology for semantics.” An alternative, more everyday description would be the one in this nice interview from Scandinavian Startups: “Finding meaning before it is evident.”
  • Ethersource learns meaning from text, which is the core of the technology; use cases include “sentiment analysis on steroids”, textual profiling and market analysis.
  • The Ethersource system is based on intrinsically scalable technology (which toward the end of the podcast turned out to be based on mimicking computation in the brain and “sparse distributed representation”) which can ingest any type of linguistic data stream; Gavagai have not been able to “saturate the system” in terms of storage despite ingesting everything they can get their hands on. The underlying technology is based on “random indexing” which is basically a kind of random projection approach (according to Sahlgren); a dimensionality reduction method which allows incremental processing (rather than, e.g., running huge SVDs.)
  • As a result of the underlying design, Ethersource builds up representations of concepts as it incorporates new data; Gavagai formulates this in the phrase “training equals learning.” The concept-based approach means that the system is extremely good at handling spelling errors and synonyms.
  • Ethersource is not based on concepts such as “documents” or “tweets”, which are completely artificial, according to chief scientist Sahlgren.
  • The system’s design also means that it does not have any problems handling different languages, even languages that use different text encodings.
  • Gavagai did not start out as a “big data” company but they are now relatively comfortable in their role as one.
  • Fredrik Olsson used to work for Recorded Future, which he feels is not a competitor to Gavagai, but would be a perfect customer.

Me and Joel were perhaps not very comfortable in our new roles as podcasters and struggled a bit with finding the right words in English. We also recorded a post-show chat in Swedish where we are more relaxed and coherent. Some tidbits from this part, which we also plan to put online at some point:

  • The Gavagai founders have a radical view of linguistics, where there is no hard line between syntax and semantics, but rather a kind of continuum.
  • They don’t believe in sampling, but try to ingest everything they can find into the system.
  • The Gavagai team tries to put aside some time every day to look at interesting concepts and connections between concepts discovered by the system.
  • They expected that a word like apple (Apple) would have a large number of different meanings, but when they looked at data from social media during a specific period in time, it had just three major meanings.
  • Language does its own disambiguation; for example, after Apple has become well-known as a software company, people have started to talk more about “apples” rather than “an apple” when they mean the fruit (if I interpreted Magnus correctly).
  • They view the stock market as a way to validate their semantic analysis. “Stock prices are the closest you can get to an objective validation.”
  • The founders came from a research background, and found that starting Gavagai gave a huge boost to their research activities due to the new pressure to build and release something that works in the “real world”

In the evening of the day of the interview (March 9, 2012), Swedish daily Svenska Dagbladet released an article about Gavagai’s Ethersource-based real-time sentiment tracking of the buzz around the contestants who would appear in the Swedish Eurovision finals the following day. In the end, the Ethersource forecasts turned out to be very accurate.

Although it’s far from clear what the next episodes of the podcast will be about, in general we will restrict ourselves to interviewing interesting companies or scientists (rather than just talking amongst ourselves), with a bias towards Swedish interviewees since this is where we are located and it might be interesting for people from other locations to hear what is going on here.

EDIT 17/3 2012: Our podcast jingle was created by Karl Ekdahl, the man behind the awesome Ekdahl Moisturizer, among many other things.

A good week for (big) data (science)

Perhaps as a subconscious compensation for my failure to attend Strata 2012 last week (I did watch some of the videos and study the downloads from the “Two Most Important Algorithms in Predictive Modeling Today” session), I devoted this week to more big-data/data-science things than usual.

Monday to Wednesday were spent at a Hadoop and NGS (Next Generation [DNA] Sequencing) data processing hackathon hosted by CSC in Espoo, Finland. All of the participants were very nice and accomplished; I’ll just single out two people for having developed high-throughput DNA sequencing related Hadoop software: Matti Niemenmaa, who is the main developer of Hadoop-BAM, a library for manipulating aligned sequence data in the cloud, and Luca Pireddu, who is the main developer of Seal, which is a nice Hadoop toolkit for sequencing data which enables running several different types of tasks in distributed fashion. Other things we looked at was the CloudBioLinux project, map/reduce sequence assembly using Contrail and CSC’s biological high-throughput data analysis platform Chipster.

On Friday, me and blog co-author Joel went to record our first episode of the upcoming Follow the Data podcast series with Fredrik Olsson and Magnus Sahlgren from Gavagai. In the podcast series, we will try to interview mainly Swedish but also other companies that we feel are big data or analytics related in an interesting way. Today I have been listening to the first edit and feel relatively happy with it, even though it is quite rough, owing to our lack of experience. I also hate to hear my own recorded voice, especially in English … I am working on one or two blog posts to summarize the highlights of the podcast (which is in English) and the following discussion in Swedish.

Over the course of the week, I’ve also worked in the evenings and on planes to finish an assignment for an academic R course I am helping out with. I decided to experiment a bit with this assignment and to base it on a Kaggle challenge. The students will download data from Kaggle and get instructions that can be regarded as a sort of “prediction contests 101″, discussing the practical details of getting your data into shape, evaluating your models, figuring out which variables are most important and so on. It’s been fun and can serve as a checklist for my self in the future.

Stay tuned for the first episode of Follow the Data podcast!

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 46 other followers