Follow the Data

A data driven blog

Archive for the category “Companies”

Personal reflections on data science jobs

A bit more than a year ago, I took the plunge and left my academic job to try my luck as a corporate data scientist, first at IBM (obviously a very big company) and now at Peltarion (a startup which I still want to call small although it is growing rapidly). I am not sure if this blog post is premature or not, but in any case I’d like to share some of my experiences and impressions of the different roles I’ve been in. So without further ado, I present my three last data science positions!

(1) Bioinformatics scientist at Stockholm university (May 2010- May 2017) + freelance gigs.

I was working as a senior bioinformatician at SciLifeLab/Stockholm University in different capacities for seven years. At first I was hired as a general bioinformatics go-to person in a so-called core facility that does DNA sequencing, where I would be involved in a lot of different kinds of things: setting up data pipelines, deciding on quality control routines, trying to figure out what had gone wrong, delivering data to and communicating with customers, performing routine or custom analysis, and sometimes doing some actual research and writing papers. After a while, I moved into a different role where my job was more explicitly to help researchers with data analysis, statistics and programming – more research-oriented and long-term work. In a way, I was an academic data science consultant. Of course, we didn’t really call it “data science” because we were doing science, plain and simple, but in terms of what we did all day, it was in many ways similar to “data science” in industry.

Characteristics of data science in an academic (biology) setting

Note: this is the type of role I have the most experience with, or the most data on, if you will, so I am more confident about the pronouncements here than in the other categories.

  • The final product is almost always a paper. This has some positive and negative implications. On the good side, there is (at least nowadays) a strong focus on reproducibility. On the bad side, there is almost no emphasis on putting predictive models into production or making them easily usable. Code quality can also be spotty as a result.
  • Bioinformatics data scientists tend to be good at data visualization, often in R or Python. They understand the concept of batch effects (drift in distribution parameters) and are good at dealing with high-dimensional data where the number of examples is usually much smaller than the number of dimensions (n << p), for example datasets with measurements of 20,000 genes for 20 different individuals. This makes it necessary for bioinformatics data scientists to be familiar with dimensionality reduction and multivariate methods such as PCA, PLS, t-SNE and so on.
  • They like to use notebooks (Jupyter or R Markdown) to communicate analyses, because these have a similar structure to scientific presentations or manuscripts.
  • They often like to use pipelining tools such as Nextflow, Snakemake or Bpipe to chain operations together.

During this time, I was also consulting part-time (usually less than 10%) for a few startups. From one of these gigs I learned to build very complex processing pipelines with Snakemake. From another, I learned to build obscure functionality for web applications in Shiny. These are both tools that fit naturally into a bioinformatician’s mindset. For yet another customer, I suggested a way to use PCA and MDS to view their data from a global point of view which they had not considered, guiding them onto a path that eventually resulted in this Medium article.

 

(2) Senior Data Scientist at IBM (June-November 2017).

After having been an academic consultant for quite a while, I decided to try to be a corporate one for a change. I got a position at IBM’s consulting arm, Global Business Services, in Kista outside of Stockholm. Since I was only there for six months, I only had time to participate in a handful of projects, which were mostly related to the manufacturing industry. Fortunately, the knowledge of high-dimensional data that I had from biology came in good stead when working on these problems. It was not difficult to apply the skills I had obtained from academia in this setting.

Characteristics of data science in a “big consulting” setting

Note: With my short experience, I have a hard time isolating out for some of the points below if they are true in general for consulting companies or company (IBM in this case) specific.

  • Pragmaticism is the word to summarize data science in “big consulting”. There isn’t time to think through every wrinkle of a problem as there is (albeit in theory only) in academia. The end goal is specified by a contract which you try to fulfill as closely as possible within the allotted amount of time. Notably, your task is not to do as much as possible but to do exactly what has been agreed upon. There is almost always a trade-off between time and model performance.
  • Consultants are good at giving effective presentations. One of the first things I learned was to completely rework the way I had done presentations in academia to more clearly highlight the important findings and tailor them for the management level in companies. Communication is a very important skill for a data science consultant; maybe the most important one.
  • Like in academia, there is also not that much emphasis on productization, because that part will typically be handled by a software engineering team that comes in after you have completed a proof of concept (PoC), if that PoC leads to a longer engagement. On the other hand the IBM stack (see below) has good support for deploying models e.g. via NodeRed.
  • (This part might be more or less company-specific) In my team, we did not make very much use of code version control with Github, for a couple of different reasons. Since we worked mostly with short PoC projects, it was more prioritized to find a promising approach in the allotted time, after which the software engineering team would come in to build the final implementation given a prolonged contract. Also, some of my colleagues worked mainly with non-code tools such as SPSS Modeler, which has its own built-in version control. We ensured reproducibility mainly through the version control mechanism in Box, where we stored scripts, documentation and metadata.
  • Automated data cleaning and model building (AutoML) are important in this setting because of the time constraints. Data cleaning can yield big “quick wins” but is tedious and a lot can be gained by automating it, for example with packages such as vtreat for R. AutoML with TPOT, auto-sklearn or H20 is interesting for rapidly finding a good-enough model.
  • Feature importance or other types of model explanation are very important for communicating results to customers (also see below). Decision trees are still used surprisingly often, and for random forests and gradient boosting, there is feature importance and various tree-model explanation interfaces.
  • It’s quite common to encounter projects with unbalanced data and to use tools like SMOTE, ADASYN or ROSE to do smart oversampling of the rare class(es). It is also not uncommon that some classes are so rare that one needs to go for an anomaly detection approach rather than standard classification.
  • (Company specific, at least in part) In terms of tooling, there was a larger emphasis on using commercial products (preferably from the IBM ecosystem) such as SPSS Modeler rather than open-source programming languages. Naturally, one has to rapidly become conversant with Bluemix (now called IBM Cloud) offerings and associated products in order to be an effective IBM consultant.

 

(3) Data scientist at Peltarion (Nov 2017-).

In the autumn of 2017, I got an offer from a deep learning company, Peltarion, that I had applied to before starting at IBM. I decided to take it on the strength of the skills of my new colleagues, many of whom I knew from the Stockholm AI and machine learning scene. As the company is a startup, I have worn many hats during the first six months, working in customer projects, writing documentation and blog posts, testing our deep learning platform, sitting together with beta testers, keeping an eye on competitors and so on.

Characteristics of data science in a startup setting: (surely not representative of all startups…)

Note: I suspect that the variance among startups is much higher than among academic groups or big consulting companies, so almost everything here is probably highly company-specific.

  • (Possibly company-specific) There is more emphasis on software engineering practices and than academia or big consulting. Git and Github (or some equivalent) are not “nice-to-haves” but the core of the whole enterprise, and frequent pull requests and code reviews much more common. Virtual environments and containers (e g Docker) are important (though also found in academic bioinformatics to a large extent.)
  • Data scientists in startups tend to think more about deployment and productization of models, because it hits closer to home (there often isn’t a supporting software engineering team to do that for the data scientists, or the startup is building its own deployment functionality, like we are at Peltarion).
  • (Possibly company-specific) Startup data scientists tend to be more informed about the latest technical advances in machine learning. Consultants don’t have time to keep up as much (or to install and play with the latest tools) and academics are often more interested in keeping up with the latest scientific advances in their specific field rather than general ML news. It is also more important to keep track of competitors.
  • (Possibly company-specific, e.g. Spotify uses Luigi) Reproducibility is achieved by writing libraries rather than chaining together operations with pipelines. Continuous integration (CI), like with Travis or Jenkins, is much more common than in academia, although it is starting to appear there as well. For us at Peltarion, CI is essential because we need to move fast and make every effort to minimize technical debt that could come back and bite us in the future.

I hope you enjoyed this highly subjective look at different kinds of data scientist positions. Feel free to ask questions in the comments section or provide your own views on different roles.

 

Data-intensive wellness companies

I had some trouble coming up with a term to describe the three companies that I will discuss here: Arivale, Q and iCarbonX. What they have in common (in my opinion) is that they

  • explicitly focus on individual’s health and wellness (wellness monitoring),
  • generate molecular and other data using many different platforms (multi-omics), resulting in tens or hundreds of thousands of measurements for each individual data point,
  • use or claim to use artificial intelligence/machine learning to reach their goals.

So the heading of this blog post could just as well have been for instance “AI wellness companies” or “Molecular wellness monitoring companies”. The point with using “data-intensive” is that they all generate much more extensive molecular data on their users (DNA sequencing, RNA sequencing, proteomics, metagenomics, …) than, say, WellnessFX, LifeSum or more niche wellness solutions.

I associate these three companies with three big names in genomics.

Arivale was founded by Leroy Hood, who is president of the Institute for Systems Biology and was involved in developing the automatization of DNA sequencing. In connection with Arivale, Hood as talked about dense dynamic data clouds that will allow individuals to track their health status and make better lifestyle decisions. Arivale’s web page also talks a lot about scientific wellness. They have different plans, including a 3,500 USD one-time plan. They sample blood, saliva and the gut microbiome and have special coaches who give feedback on findings, including genetic variants and how well you have done with your FitBit.

Q, or q.bio, (podcast about them here) seems to have grown out of Michael Snyder‘s work on iPOPs, “individual personal omics profiles“, which he first developed on himself as the first person to do both DNA sequencing, repeated RNA sequencing, metagenomics etc. on himself. (He has also been involved in a large number of other pioneering genomics projects.) Q’s web site and blog talks about quantified health and the importance of measuring your physiological variables regularly to get a “positive feedback loop”. In one of their blog posts, they talk about dentistry as a model system where we get regular feedback, have lots and lots of longitudinal data on people’s dental health, and therefore get continuously improving dental status at cheaper prices. They also make the following point: We live in a world where we use millions of variables to predict what ad you will click on, what movie you might watch, whether you are creditworthy, the price of commodities, and even what the weather will be like next week. Yet, we continue to conduct limited clinical studies where we try and reduce our understanding of human health and pathology to single variable differences in groups of people, when we have enormous evidence that the results of these studies are not necessarily relevant for each and every one of us.

iCarbonX, a Chinese company, was founded by (and is headed by) Wang Jun, the former wunderkid-CEO of Beijing Genomics Institute/BGI. A couple of years ago, he gave an interview to Nature where he talked about why he was stepping down as BGI’s CEO to “devote himself to a new “lifetime project” of creating an AI health-monitoring system that would identify relationships between individual human genomic data, physiological traits (phenotypes) and lifestyle choices in order to provide advice on healthier living and to predict, and prevent, disease.” iCarbonX seems to be the company embodying that idea. Their website mentions “holographic health data” and talks a lot about artificial intelligence and machine learning, more so than the two other companies I highlight here. They also mention plans to profile millions of Chinese customers and to create an “intelligent robot” for personal health management. iCarbonX has just announced a collaboration with PatientsLikeMe, in which iCarbonX will provide “multi-omics characterization services.”

What to make of these companies? They are certainly intriguing and exciting. Regarding the multi-omics part, I know from personal experience that it is very difficult to integrate omics data sets in a meaningful way (that leads to some sort of actionable results), mostly for purely conceptual/mathematical reasons but also because of technical quality issues that impact each platform in a different way. I have seen presentations by Snyder and Hood and while they were interesting, I did not really see any examples of a result that had come through integrating multiple levels of omics (although it is of course useful to have results from “single-level omics” too!).

Similarly, with respect to AI/ML, I expect that a  larger number of samples than what these companies have will be needed before, for instance, good deep learning models can be trained. On the other hand, the multi-omics aspect may prove helpful in a deep learning scenario if it turns out that information from different experiments can be combined some sort of transfer learning setting.

As for the wellness benefits, it will likely be several years before we get good statistics on how large an improvement one can get by monitoring one’s molecular profiles (although it is certainly likely that it will be beneficial to some extent.)

PostScript

There are some related companies or projects that I do not discuss above. For example, Craig Venter’s Human Longevity Inc is not dissimilar to these companies but I perceive it as more genome-sequencing focused and explicitly targeting various diseases and aging (rather than wellness monitoring.) Google’s/Verily’s Baseline study has some similarities with respect to multi-omics but is anonymized and  not focused on monitoring health. There are several academic projects along similar lines (including one to which I am currently affiliated) but this blog post is about commercial versions of molecular wellness monitoring.

Finnish companies that do data science

I should start by saying that I have shamelessly poached this blog post from a LinkedIn thread started by one Ville Niemijärvi of Louhia Consulting in Finland. In my defence,  LinkedIn conversations are rather ephemeral and I am not sure how completely they are indexed by search engines, so to me it makes sense to sometimes highlight them in a slightly more permanent manner.

Ville asked for input (and from now on I am paraphrasing and summarising) on companies in Finland that do data analytics “for real”, as in data science, predictive analytics, data mining or statistical modelling. He required that the proposed companies should have several “actual” analysts and be able to show references to work performed in advanced analytics (i e not pure visualization/reporting). In a later comment he also mentioned price optimization, cross-sell analysis, sales prediction, hypothesis testing, and failure modelling.

The companies that had been mentioned when I went through this thread are listed below. I’ve tried to lump them together into categories after a very superficial review and would be happy to be corrected if I have gotten something wrong.

[EDIT 2016-02-04 Added a bunch of companies.]

Louhia analytics consulting (predictive analytics, Azure ML etc.)
BIGDATAPUMP analytics consulting (Hadoop, AWS, cloud etc.)
Houston Analytics analytics consulting (analytics partner of IBM)
Top Data Science analytics and IT consulting
Gofore IT architecture
Digia IT consulting
Techila Technologies distributed computing middleware
CGI IT consulting, multinational
Teradata data warehousing, multinational
Avanade IT consulting, multinational
Deloitte financial consulting, multinational
Information Builders business intelligence, multinational
SAS Institute analytics software, multinational
Tieto IT services, multinational (but originally Finnish)
Aureolis business intelligence
Olapcon business intelligence
Big Data Solutions business intelligence
Enfo Rongo business intelligence
Bilot business intelligence
Affecto digital services
Siili digital services
Reaktor digital services
Valuemotive digital services
Solita digital services
Comptel digital services?
Dagmar marketing
Frankly Partners marketing
ROIgrow marketing
Probic marketing
Avaus marketing
InlineMarket marketing automation
Steeri customer analytics
Tulos Helsinki customer analytics
Andumus customer analytics
Avarea customer analytics
Big Data Scoring customer analytics
Suomen Asiakastieto credit & risk management
Silta HR analytics
Quva industrial analytics
Ibisense industrial analytics
Ramentor industrial analytics
Indalgo manufacturing analytics
TTS-Ciptec optimization, sensor
SimAnalytics Logistics, simulation
Relex supply chain analytics
Analyse2 assortment planning
Genevia bioinformatics consultancy
Fonecta directory services
Monzuun analytics as a service
Solutive data visualization
Omnicom communications agency
NAPA naval analytics, ship operations
Primor consulting telecom?

There was an interesting comment saying that CGI manages its global data science “virtual team” from Finland and that they employ several successful Kagglers, one of whom was rated #37 out of 450000 Kaggle users in 2014.

On a personal note, I was happy to find a commercial company (Genevia) which appears to do pretty much the same thing as I do in my day job at Scilifelab Stockholm, that is, bioinformatics consulting (often with an emphasis on high throughput sequencing), except that I do it in an academic context.

 

 

 

Some interesting company announcements

  • Algorithmia, the open marketplace for algorithms, is now live. I find it an interesting concept: to build a community around algorithm development, where users can build on each other’s algorithms and make them available as a web service.
  •  SolveBio is also in public beta as of today. Please refer to an older post for some hints on how to use the API.
  • ForecastThis just launched their DSX platform for automated model testing and building. Looked pretty impressive from a small trial with a tricky dataset I have – a large number of models was constructed and run on the dataset with a set of metrics reported for each, and many of those looked better than the ones I had from e g random forest, but the trial version of the platform does not include access to the actual models, so I wasn’t able to see the details.
  • Seven Bridges Genomics will release the first graph-based human reference genome and related tools. There has been a lot of talk about graph-based genome references (a graph is a natural way to represent the many kinds of natural variation found among human genomes), but the tools to handle them have been lacking.

Follow the Data podcast, episode 4: Self-tracking with Niklas Laninge

In this episode of our podcast, we shift our focus from the “big data” themes in episodes 1-3 to personal data and self-tracking. We talked to Niklas Laninge, founder of Psykologifabriken (“The Psychology Factory”) and COO of Hoa’s Tool Shop, which are both relatively new startups based in Stockholm and which use applied psychology in innovate ways to facilitate lasting behavior change – in the case of the latter company, using digital tools such as smart phone apps. Niklas is also an avid collector of data on himself and describes some things he has found out by analyzing those data – and remarks that “When my [Nike] Fuelband broke, part of myself broke as well.”

At one point, I (Mikael) miserably failed to get the details right about The Human Face of Big Data project, which I erroneously call “Faces of Big Data” in the podcast. Also, I said that it was created by Greenplum, when in fact it was developed by Against All Odds productions (Rick Smolan and Jennifer Erwitt) and sponsored by EMC (of which Greenplum is a division.)

Some of the things we discussed:

Viary, a tools that facilitates behavior change in organizations or individuals

– Clinical trials showing promising results from using Viary to treat depression

– “Dance-offs” as a fun way to interact with people on the dance floor and get an extreme exercise session

Listen to the podcast | Follow The Data #4 : Self Tracking with Niklas Laninge

Synapse – a Kaggle for molecular medicine?

I have frequently extolled the virtues of collaborative crowdsourced research, online prediction contests and similar subjects on these pages. Almost 2 years ago, I also mentioned Sage Bionetworks, which had started some interesting efforts in this area at the time.

Last Thursday, I (together with colleagues) got a very interesting update on what Sage is up to at the moment, and those things tie together a lot of threads that I am interested in – prediction contests, molecular diagnostics, bioinformatics, R and more. We were visited by Adam Margolin, who is director of computational biology at Sage (one of their three units).

He described how Sage is compiling and organizing public molecular data (such as that contained in The Cancer Genome Atlas) and developing tools for working with it, but more importantly, that they had hit upon prediction contests as the most effective way to generate modelling strategies for prognostic and diagnostic applications based on these data. (As an aside, Sage now appears to be focusing mostly on cancers rather than all types of disease as earlier; applications include predicting cancer subtype severity and survival outcomes.) Adam thinks that objectively scored prediction contests lets researchers escape from the “self-assessment trap“, where one always unconsciously strives to present the performance of one’s models in the most positive light.

They considered running their competitions on Kaggle (and are still open to it, I think) but given that they already had a good infrastructure for reproducible research, Synapse, they decided to tweak that instead and run the competitions on their own platform. Also, Google donated 50 million core hours (“6000 compute years”) and petabyte-scale storage for the purpose.

There was another reason not to use Kaggle as well. Sage wanted participants to not only upload predictions for which the results is shown on a dynamic leaderboard (which they do), but also to force them to provide runnable code which is actually executed on the Sage platform to generate the predictions. The way it works is that competitors need to use R to build their models, and they need to implement two methods, customTrain() and customPredict() (analogous to the train() and predict() methods implemented by most or all statistical learning methods in R) which are called by the server software. Many groups do not like to use R for their model development but there are ways to easily wrap arbitrary types of code inside R.

The first full-scale competition run on Synapse (which is, BTW, not only a competition platform but a “collaborative compute space that allows scientists to share and analyze data together”, as the web page states) was the Sage/DREAM Breast Cancer Prognosis Challenge, which uses data from a cohort of almost 2,000 breast cancer patients. (The DREAM project is itself worthy of another blog post as a very early (in its seventh year now, I think) platform for objective assessment of predictive models and reverse engineering in computational biology, but I digress …)

The goal of the Sage/DREAM breast cancer prognosis challenge is to find out whether it is possible to identify reliable prognostic molecular signatures for this disease. This question, in a generalized form (can we define diseases, subtypes and outcomes from a molecular pattern?), is still a hot one after many years of a steady stream of published gene expression signatures that have usually failed to replicate, or are meaningless (see e g Most Random Gene Expression Signatures Are Significantly Associated with Breast Cancer Outcome). Another competition that I plugged on this blog, SBV Improver, also had as its goal to assess if informative signatures could be found and its outcomes were disclosed recently. The result there was that out of four diseases addressed (multiple sclerosis, lung cancer, psoriasis, COPD), the molecular portrait (gene expression pattern) for one of them (COPD) did not add any information at all to known clinical characteristics, while for the others the gene expression helped to some extent, notably in psoriasis where it could discriminate almost perfectly between healthy and diseased tissue.

In the Sage/DREAM challenge, the cool thing is that you can directly (after registering an account) lift the R code from the leaderboard and try to reproduce the methods. The team that currently leads, Attractor Metagenes, has implemented a really cool (and actually quite simple) approach to finding “metagenes” (weighted linear combinations of actual genes) by an iterative approach that converges to certain characteristic metagenes, thus the “attractor” in the name. There is a paper on arXiv outlining the approach. Adam Margolin said that the authors have had trouble getting the paper published, but the Sage/DREAM competition has at least objectively shown that the method is sound and it should find its way into the computational biology toolbox now. I for one will certainly try it for some of my work projects.

The fact that Synapse stores both data and models in an open way has some interesting implications. For instance, the models can be applied to entirely new data sets, and they can be ensembled very easily (combined to get an average / majority vote / …). In fact, Sage even encourages competitors to make ensemble versions of models on the leaderboard to generate new models while the competition is going on! This is one step beyond Kaggle. Indeed, there is a team (ENSEMBLE) that specializes in this approach and they are currently at #2 on the leaderboard after Attractor Metagenes.

In the end, the winning team will be allowed to publish a paper about how they did it in Science Translational Medicine without peer review – the journal (correctly I think) assumes that the rigorous evaluation process in Synapse is more objective that peer review. Kudos to Science Translational Medicine for that.

There’s a lot more interesting things to mention, like how Synapse is now tackling “pan-cancer analysis” (looking for commonalities between *all* cancers), how they looked at millions of models to find out general rules of thumb about predictive models (discretization makes for worse performance, elastic net algorithms work best on average, prior knowledge and feature engineering is essential for good performance, etc.)
Perhaps the most remarkable thing in all of this, though, is that someone has found a way to build a crowdsourced card game, The Cure, on top of the Sage/DREAM breast cancer prognosis challenge in order to find even better solutions. I have not quite grasped how they did this – the FAQ states:

TheCure was created as a fun way to solicit help in guiding the search for stable patterns that can be used to make biologically and medically important predictions. When people play TheCure they use their knowledge (or their ability to search the Web or their social networks) to make informed decisions about the best combinations of variables (e.g. genes) to use to build predictive patterns. These combos are the ‘hands’ in TheCure card game. Every time a game is played, the hands are evaluated and stored. Eventually predictors will be developed using advanced machine learning algorithms that are informed by the hands played in the game.

But I’ll try The Cure right now and see if I can figure out what it is doing. You’re welcome to join me!

Follow the Data podcast, episode 2: King of BigData

In the second episode of the FTD podcast, we talked to big data consultant Johan Pettersson (his company is actually called Big Data AB; what a catch to be able to obtain that name despite the Swedish trademark regulations!) and Thomas Hartwig, CTO of King.com, a company that produces “skill games” where you can win money by being more skillful than your competitors.

We knew practically nothing about King.com coming into the interview and were surprised to learn that they are the second biggest game producer on Facebook! Some other things of note from the interview:

  • King.com currently captures about 1.5 billion game events each day from about 12 million users per day;
  • They don’t have a dedicated data analysis group but rather an “embedded analyst”  in each developer team (each game has its own team);
  • Johan Pettersson does not think the demand for big data specialists or data scientists in Sweden is that high at the moment (although everyone is talking about “big data”, almost no one is really working with it), but it will probably be in 1-2 years.
  • However, good data analysts are in high demand and therefore hard to find.

Podcast link: Follow the Data | Episode 2 – King of BigData Podcast

The “AfterDark” discussion afterwards (in Swedish)
Follow the Data | Episode 2 – King of BigData After Dark

Follow the Data podcast, episode 1: Gavagai! Gavagai!

We have made available the first episode of the Follow the Data podcast! Hope you enjoy it.

Podcast link: Follow The Data | Episode 1 – Gavagai! Gavagai!

This first episode, as has been mentioned before on this blog, is about a Stockholm startup company, Gavagai, which provides a technology platform called Ethersource. We interviewed the company’s CDO (chief data officer), Fredrik Olsson, and the chief scientist, Magnus Sahlgren, and we think it resulted in a very interesting chat, although the sound quality is perhaps not ideal due to our inexperience with podcasting.

Some interesting tidbits from the conversation:

  • The name “Gavagai” comes from a thought experiment by Quine demonstrating the “indeterminacy of translation“. It’s also the reason for the presence of the little rabbit on the Gavagai web page.
  • Olsson describes Ethersource as a “semantic processing layer of the big data stack” and a “base technology for semantics.” An alternative, more everyday description would be the one in this nice interview from Scandinavian Startups: “Finding meaning before it is evident.”
  • Ethersource learns meaning from text, which is the core of the technology; use cases include “sentiment analysis on steroids”, textual profiling and market analysis.
  • The Ethersource system is based on intrinsically scalable technology (which toward the end of the podcast turned out to be based on mimicking computation in the brain and “sparse distributed representation”) which can ingest any type of linguistic data stream; Gavagai have not been able to “saturate the system” in terms of storage despite ingesting everything they can get their hands on. The underlying technology is based on “random indexing” which is basically a kind of random projection approach (according to Sahlgren); a dimensionality reduction method which allows incremental processing (rather than, e.g., running huge SVDs.)
  • As a result of the underlying design, Ethersource builds up representations of concepts as it incorporates new data; Gavagai formulates this in the phrase “training equals learning.” The concept-based approach means that the system is extremely good at handling spelling errors and synonyms.
  • Ethersource is not based on concepts such as “documents” or “tweets”, which are completely artificial, according to chief scientist Sahlgren.
  • The system’s design also means that it does not have any problems handling different languages, even languages that use different text encodings.
  • Gavagai did not start out as a “big data” company but they are now relatively comfortable in their role as one.
  • Fredrik Olsson used to work for Recorded Future, which he feels is not a competitor to Gavagai, but would be a perfect customer.

Me and Joel were perhaps not very comfortable in our new roles as podcasters and struggled a bit with finding the right words in English. We also recorded a post-show chat in Swedish where we are more relaxed and coherent. Some tidbits from this part, which we also plan to put online at some point:

  • The Gavagai founders have a radical view of linguistics, where there is no hard line between syntax and semantics, but rather a kind of continuum.
  • They don’t believe in sampling, but try to ingest everything they can find into the system.
  • The Gavagai team tries to put aside some time every day to look at interesting concepts and connections between concepts discovered by the system.
  • They expected that a word like apple (Apple) would have a large number of different meanings, but when they looked at data from social media during a specific period in time, it had just three major meanings.
  • Language does its own disambiguation; for example, after Apple has become well-known as a software company, people have started to talk more about “apples” rather than “an apple” when they mean the fruit (if I interpreted Magnus correctly).
  • They view the stock market as a way to validate their semantic analysis. “Stock prices are the closest you can get to an objective validation.”
  • The founders came from a research background, and found that starting Gavagai gave a huge boost to their research activities due to the new pressure to build and release something that works in the “real world”

In the evening of the day of the interview (March 9, 2012), Swedish daily Svenska Dagbladet released an article about Gavagai’s Ethersource-based real-time sentiment tracking of the buzz around the contestants who would appear in the Swedish Eurovision finals the following day. In the end, the Ethersource forecasts turned out to be very accurate.

Although it’s far from clear what the next episodes of the podcast will be about, in general we will restrict ourselves to interviewing interesting companies or scientists (rather than just talking amongst ourselves), with a bias towards Swedish interviewees since this is where we are located and it might be interesting for people from other locations to hear what is going on here.

EDIT 17/3 2012: Our podcast jingle was created by Karl Ekdahl, the man behind the awesome Ekdahl Moisturizer, among many other things.

Hello 2012!

The first blog post of the new year. I made some updates to the Swedish big data company list from last year. I’ll recap the additions here so you don’t have to click on that link –

  • Markify is a service that searches a large set of databases for registered trademarks that are similar, in sound or in writing, to a given query – like a name you have thought up for your next killer startup. As described on the company’s website, determining similarity is not that clear-cut, so (according to this write-up) they have adopted a data-driven strategy where they train their algorithm on “actual case literature of disputed trademark claims to help it discover trademarks that were similar enough to be contested.” They claim it’s the worl’d most accurate comprehensive trademark search.
  • alaTest compiles, analyzes and rates product reviews to help customers select the most suitable product for them.
  • Intellus is a business process / business intelligence company. Frankly, these terms and web sites like theirs normally make me fall asleep, but they have an ad for a master’s project out where they propose research to “find and implement an effective way of automating analysis in non-normalized data by applying different approaches of machine learning”, where the “platform for distributed big data analysis is already in place.” They promise a project at “the bleeding edge technology of machine learning and distributed big data analysis.”
  • Although I haven’t listed AstraZeneca as a “big data” company (yet), they seem to be jumping the “data science” train as they are now advertising for “data angels” (!) and “predictive science data experts.”

On the US stage, I’m curious about a new company called BigML, which is apparently trying to tackle a problem that many have thought about or tried to solve, but which has proven very difficult, that is, to provide a user-friendly and general solution for building predictive models based on a data source. A machine learning solution for regular people, as it were. This blog post talks about some of the motivations behind it. I’ve applied for an invite and will write up a blog post if I get the chance to try it.

Finally, I’d like to recommend this Top 10 data mining links of 2011 list. I’m not usually very into top-10 lists, but this one contained some interesting stuff that I had missed. Of course, there is the MIC/MINE method which was published in Science, a clever generalization of correlation that works for non-linear relationships (to over-simplify a bit).  As this blog post puts it, “the consequential metric goes far beyond traditional measures of correlation, and rather towards what I would think of as a general pattern recognition algorithm that is sensitive to any type of systematic pattern between two variables (see the examples in Fig. 2 of the paper).”

Then there are of course the free data analysis textbooks, the free online ML and AI courses and IBM’s systems that defeated human Jeopardy champions, all of which I have covered here (I think.) Finally, there are links to two really cool papers. The first of them, Graphical Inference for Infoviz (where one of the authors is R luminary Hadley Wickham), introduces a very interesting method of “visual hypothesis testing” based on generating “decoy plots” that are based on the null hypothesis distribution, and letting a test person pick out the actual observed data among the decoys. The procedure has been implemented in an R package called nullabor. I really liked their analogy between hypothesis testing and a trial (the term “the statistical justice system”!):

Hypothesis testing is perhaps best understood with an analogy to the criminal justice system. The accused (data set) will be judged guilty or innocent based on the results of a trial (statistical test). Each trial has a defense (advocating for the null hypothesis) and a prosecution (advocating for the alternative hypothesis). On the basis of how evidence (the test statistic) compares to a standard (the p-value), the judge makes a decision to convict (reject the null) or acquit (fail to reject the null hypothesis). Unlike the criminal justice system, in the statistical justice system (SJS) evidence is based on the similarity between the accused and known innocents, using a specific metric defined by the test statistic. The population of innocents, called the null distribution, is generated by the combination of null hypothesis and test statistic. To determine the guilt of the accused we compute the proportion of innocents who look more guilty than the accused. This is the p-value, the probability that the accused would look this guilty if they actually were innocent.

The other very cool article is from Gary King’s lab and deals with the question of comparing different clusterings of data, and specifically determining a useful or insightful clustering for the user. They did this by implementing all (!) known clustering methods plus some new ones in a common interface in an R package. They then cluster text documents using all clustering methods and project the clusterings into a space that can be visualized and interactively explored to get a feeling for what the different methods are doing.

Big data companies in Sweden

Edit in Feb 2015

A lot of people are still visiting this page. Bear in mind that it was written about four years ago, in 2011, and is unlikely to give a good overview of data-driven companies in Sweden today. Thank you for your cooperation. I may try to update it in some form at some point.

Alphabetically ordered list (see below for context & edit history)

AdAction – ad optimization – Stockholm

alaTest – product review comparisons – Stockholm

Augify – Real-time information capture, interpretation and visualization (or something like that) – Stockholm

Big Data AB – big data consultancy – Stockholm

Brummer – hedge fund – Stockholm

Burt – ad optimization – Gothenburg

Campanja – Online advertising – Stockholm

Experlytics – health analytics – Malmö

ExpertMaker – search, recommendation and discovery – Malmö

Gavagai – text analysis – Stockholm

Intellus – business intelligence – Stockholm

Keybroker – ad optimization – Stockholm

King.com – online gaming – Stockholm

Klarna – online payment services – Stockholm

Markify – trademark search – Stockholm

NeoTechnology – graph database development – Malmö (?)

Recorded Future – temporal analytics – Gothenburg

Saplo – text analysis – Malmö

Spotify – music streaming service – Stockholm (HQ), Gothenburg

Svensk lånemarknad – Helps customers find the best loans – Stockholm

Tailsweep – market communication – Stockholm

Tink – Stockholm

Tripbirds – “social hotel booking” – Stockholm

***

When I was at the Strata conference earlier this spring, I noticed there were very few European participants. While the US and Silicon Valley in particular seemed to be going nuts over “big data” and analytics, I haven’t seen much buzz in Sweden (where I currently live) or in Europe as a whole. This question thread on Quora seems to confirm that there really aren’t that many European big data companies. If there are few in Europe, there should be essentially none in little Sweden. So I thought it would be fun to round up the Swedish big data-related companies that I’m aware of.

I know as well as anyone that the term “big data” is not well defined and that I have inevitably missed many companies (false negative) and perhaps included some that don’t consider themselves big data companies (false positives, at least from their point of view!). I’d be very happy to get feedback from companies either way. By the way, there is a nice discussion (also on Quora) about what the big data space looks like as a whole, mostly from a US perspective. But back to Sweden!

Recorded Future, which I’ve blogged about repeatedly and which has also been covered in Wired (in a rather hyperbolic way, one might add), has development offices in Gothenburg. Its founder Christopher Ahlberg previously created the useful analytics/visualization package Spotfire, which has since been sold to Tibco.

Spotfire sounds a bit like Spotify, which is of course a very popular music streaming service that has taken at least Sweden (and soon the world?) by storm. I know I have moved maybe 80% of my music listening time into Spotify. Like the “old” music recommendation/social network service last.fm, Spotify has a lot of interesting data that could be mined in various ways. They are currently looking for people who know things like Hadoop and Python.

Klarna was covered in a recent Economist article about data-driven finance. It seems to be an interesting company that tries to do everything in Erlang, a functional programming language first developed at Ericsson. Klarna allows customers to shop online by typing their date of birth, name and address – they don’t actually pay until they have received the goods. This is made possible by combining and analyzing data that goes way beyond conventional credit scores.

There are a couple of ad optimization companies (surprise surprise) that hire big data experts. I’ve already blogged about one of them, Burt. Apparently, last fall they were looking for a “big data wizard” to work with Hadoop, Pig, HBase etc. AdAction were recently looking for Hadoop programmers. Keybroker have announced positions for people who know Hadoop, EC2 and Ruby on Rails.

I’m sure that many hedge funds and other financial companies use a lot of big data methods. I’ve heard that some hedge funds within Brummer work extensively with machine learning.

The only real tech company in this roundup is NeoTechnology, the creators of the graph database Neo4j. There was quite a buzz around this graph engine/noSQL database at the Strata conference.

There are also a couple of really interesting natural language processing oriented companies (perhaps Recorded Future could also have fitted into this slot): Saplo, which offers a text analysis API with functionality for things like entity tagging, sentiment analysis, similarity search and context recognition. And Gavagai, a Stockholm-based company, doesn’t offer much information on their web page except for saying that they “… develop and employ automated and scalable methods for retrieving actionable intelligence from dynamic data.” Having met the founders through work a long time ago, I bet they are doing something really cool, though.

Again, I’d be happy to get more suggestions! Any other Swedish big data companies out there?

Update 2011-04-08 Per Mellqvist suggested the addition of Tailsweep, a marketing communication company that describes itself as a “leading media channel in blogs and social media”. According to Per, they have been Hadoop users for a while already. Apparently the name Tailsweep comes from the notion of wanting to “sweep” the long tail of social media.

Update 2012-01-15 Benoit Fallenius suggested the addition of Markify, a “name-screening tool” which identifies registered trademarks that look or sound similar to a given query (like a name you are considering for your new company). They use an algorithm which is trained on actual case literature of disputed trademark claims.

I also found a company called alaTest, which compiles and analyzes product reviews to help customers select the most suitable product for them. In their own words, they do  “statistical analysis of review data, statistical matching between products and reviews, natural language processing, data mining and opinion mining. All built on a scalable infrastructure using open source software and modern web technologies like Tornado, Solr and REST.”

Also, Intellus seems to be some sort of business intelligence company, which I mention here because they have an advert out for a master’s thesis project ” […] in the bleeding edge technology of machine learning and distributed big data analysis.”

Update 2012-05-19 Based on our latest podcast episode, we should add King.com (online gaming) and Big Data AB (consultancy).

Update 2012-06-13 Augify is a Stockholm-based startup that aims to “capture, index and store large amounts of fast-changing data in real time”, “tell stories using data visualization and interactive infographics”, etc.

Update 2012-09-06

Tink seems to be in stealth mode, “looking to expand [their] team with brilliant backend developers/data scientists” as of 2012-09

Campanja – online advertising, use a lot of AI and Erlang

Svensk lånemarknad – helps customers find the best loans, looking for predictive analyst as of 2012-09

Update 2012-09-09

Tripbirds – “social hotel booking”

Update 2013-08-26

Experlytics – electronic data capture and prediction for medicine

ExpertMaker – AI based search, recommendation and discovery

Post Navigation