We have made available the first episode of the Follow the Data podcast! Hope you enjoy it.
Podcast link: Follow The Data | Episode 1 – Gavagai! Gavagai!
This first episode, as has been mentioned before on this blog, is about a Stockholm startup company, Gavagai, which provides a technology platform called Ethersource. We interviewed the company’s CDO (chief data officer), Fredrik Olsson, and the chief scientist, Magnus Sahlgren, and we think it resulted in a very interesting chat, although the sound quality is perhaps not ideal due to our inexperience with podcasting.
Some interesting tidbits from the conversation:
- The name “Gavagai” comes from a thought experiment by Quine demonstrating the “indeterminacy of translation“. It’s also the reason for the presence of the little rabbit on the Gavagai web page.
- Olsson describes Ethersource as a “semantic processing layer of the big data stack” and a “base technology for semantics.” An alternative, more everyday description would be the one in this nice interview from Scandinavian Startups: “Finding meaning before it is evident.”
- Ethersource learns meaning from text, which is the core of the technology; use cases include “sentiment analysis on steroids”, textual profiling and market analysis.
- The Ethersource system is based on intrinsically scalable technology (which toward the end of the podcast turned out to be based on mimicking computation in the brain and “sparse distributed representation”) which can ingest any type of linguistic data stream; Gavagai have not been able to “saturate the system” in terms of storage despite ingesting everything they can get their hands on. The underlying technology is based on “random indexing” which is basically a kind of random projection approach (according to Sahlgren); a dimensionality reduction method which allows incremental processing (rather than, e.g., running huge SVDs.)
- As a result of the underlying design, Ethersource builds up representations of concepts as it incorporates new data; Gavagai formulates this in the phrase “training equals learning.” The concept-based approach means that the system is extremely good at handling spelling errors and synonyms.
- Ethersource is not based on concepts such as “documents” or “tweets”, which are completely artificial, according to chief scientist Sahlgren.
- The system’s design also means that it does not have any problems handling different languages, even languages that use different text encodings.
- Gavagai did not start out as a “big data” company but they are now relatively comfortable in their role as one.
- Fredrik Olsson used to work for Recorded Future, which he feels is not a competitor to Gavagai, but would be a perfect customer.
Me and Joel were perhaps not very comfortable in our new roles as podcasters and struggled a bit with finding the right words in English. We also recorded a post-show chat in Swedish where we are more relaxed and coherent. Some tidbits from this part, which we also plan to put online at some point:
- The Gavagai founders have a radical view of linguistics, where there is no hard line between syntax and semantics, but rather a kind of continuum.
- They don’t believe in sampling, but try to ingest everything they can find into the system.
- The Gavagai team tries to put aside some time every day to look at interesting concepts and connections between concepts discovered by the system.
- They expected that a word like apple (Apple) would have a large number of different meanings, but when they looked at data from social media during a specific period in time, it had just three major meanings.
- Language does its own disambiguation; for example, after Apple has become well-known as a software company, people have started to talk more about “apples” rather than “an apple” when they mean the fruit (if I interpreted Magnus correctly).
- They view the stock market as a way to validate their semantic analysis. “Stock prices are the closest you can get to an objective validation.”
- The founders came from a research background, and found that starting Gavagai gave a huge boost to their research activities due to the new pressure to build and release something that works in the “real world”
In the evening of the day of the interview (March 9, 2012), Swedish daily Svenska Dagbladet released an article about Gavagai’s Ethersource-based real-time sentiment tracking of the buzz around the contestants who would appear in the Swedish Eurovision finals the following day. In the end, the Ethersource forecasts turned out to be very accurate.
Although it’s far from clear what the next episodes of the podcast will be about, in general we will restrict ourselves to interviewing interesting companies or scientists (rather than just talking amongst ourselves), with a bias towards Swedish interviewees since this is where we are located and it might be interesting for people from other locations to hear what is going on here.
EDIT 17/3 2012: Our podcast jingle was created by Karl Ekdahl, the man behind the awesome Ekdahl Moisturizer, among many other things.