Follow the Data

A data driven blog

Archive for the category “Companies”

Data-intensive wellness companies

I had some trouble coming up with a term to describe the three companies that I will discuss here: Arivale, Q and iCarbonX. What they have in common (in my opinion) is that they

  • explicitly focus on individual’s health and wellness (wellness monitoring),
  • generate molecular and other data using many different platforms (multi-omics), resulting in tens or hundreds of thousands of measurements for each individual data point,
  • use or claim to use artificial intelligence/machine learning to reach their goals.

So the heading of this blog post could just as well have been for instance “AI wellness companies” or “Molecular wellness monitoring companies”. The point with using “data-intensive” is that they all generate much more extensive molecular data on their users (DNA sequencing, RNA sequencing, proteomics, metagenomics, …) than, say, WellnessFX, LifeSum or more niche wellness solutions.

I associate these three companies with three big names in genomics.

Arivale was founded by Leroy Hood, who is president of the Institute for Systems Biology and was involved in developing the automatization of DNA sequencing. In connection with Arivale, Hood as talked about dense dynamic data clouds that will allow individuals to track their health status and make better lifestyle decisions. Arivale’s web page also talks a lot about scientific wellness. They have different plans, including a 3,500 USD one-time plan. They sample blood, saliva and the gut microbiome and have special coaches who give feedback on findings, including genetic variants and how well you have done with your FitBit.

Q, or, (podcast about them here) seems to have grown out of Michael Snyder‘s work on iPOPs, “individual personal omics profiles“, which he first developed on himself as the first person to do both DNA sequencing, repeated RNA sequencing, metagenomics etc. on himself. (He has also been involved in a large number of other pioneering genomics projects.) Q’s web site and blog talks about quantified health and the importance of measuring your physiological variables regularly to get a “positive feedback loop”. In one of their blog posts, they talk about dentistry as a model system where we get regular feedback, have lots and lots of longitudinal data on people’s dental health, and therefore get continuously improving dental status at cheaper prices. They also make the following point: We live in a world where we use millions of variables to predict what ad you will click on, what movie you might watch, whether you are creditworthy, the price of commodities, and even what the weather will be like next week. Yet, we continue to conduct limited clinical studies where we try and reduce our understanding of human health and pathology to single variable differences in groups of people, when we have enormous evidence that the results of these studies are not necessarily relevant for each and every one of us.

iCarbonX, a Chinese company, was founded by (and is headed by) Wang Jun, the former wunderkid-CEO of Beijing Genomics Institute/BGI. A couple of years ago, he gave an interview to Nature where he talked about why he was stepping down as BGI’s CEO to “devote himself to a new “lifetime project” of creating an AI health-monitoring system that would identify relationships between individual human genomic data, physiological traits (phenotypes) and lifestyle choices in order to provide advice on healthier living and to predict, and prevent, disease.” iCarbonX seems to be the company embodying that idea. Their website mentions “holographic health data” and talks a lot about artificial intelligence and machine learning, more so than the two other companies I highlight here. They also mention plans to profile millions of Chinese customers and to create an “intelligent robot” for personal health management. iCarbonX has just announced a collaboration with PatientsLikeMe, in which iCarbonX will provide “multi-omics characterization services.”

What to make of these companies? They are certainly intriguing and exciting. Regarding the multi-omics part, I know from personal experience that it is very difficult to integrate omics data sets in a meaningful way (that leads to some sort of actionable results), mostly for purely conceptual/mathematical reasons but also because of technical quality issues that impact each platform in a different way. I have seen presentations by Snyder and Hood and while they were interesting, I did not really see any examples of a result that had come through integrating multiple levels of omics (although it is of course useful to have results from “single-level omics” too!).

Similarly, with respect to AI/ML, I expect that a  larger number of samples than what these companies have will be needed before, for instance, good deep learning models can be trained. On the other hand, the multi-omics aspect may prove helpful in a deep learning scenario if it turns out that information from different experiments can be combined some sort of transfer learning setting.

As for the wellness benefits, it will likely be several years before we get good statistics on how large an improvement one can get by monitoring one’s molecular profiles (although it is certainly likely that it will be beneficial to some extent.)


There are some related companies or projects that I do not discuss above. For example, Craig Venter’s Human Longevity Inc is not dissimilar to these companies but I perceive it as more genome-sequencing focused and explicitly targeting various diseases and aging (rather than wellness monitoring.) Google’s/Verily’s Baseline study has some similarities with respect to multi-omics but is anonymized and  not focused on monitoring health. There are several academic projects along similar lines (including one to which I am currently affiliated) but this blog post is about commercial versions of molecular wellness monitoring.


Finnish companies that do data science

I should start by saying that I have shamelessly poached this blog post from a LinkedIn thread started by one Ville Niemijärvi of Louhia Consulting in Finland. In my defence,  LinkedIn conversations are rather ephemeral and I am not sure how completely they are indexed by search engines, so to me it makes sense to sometimes highlight them in a slightly more permanent manner.

Ville asked for input (and from now on I am paraphrasing and summarising) on companies in Finland that do data analytics “for real”, as in data science, predictive analytics, data mining or statistical modelling. He required that the proposed companies should have several “actual” analysts and be able to show references to work performed in advanced analytics (i e not pure visualization/reporting). In a later comment he also mentioned price optimization, cross-sell analysis, sales prediction, hypothesis testing, and failure modelling.

The companies that had been mentioned when I went through this thread are listed below. I’ve tried to lump them together into categories after a very superficial review and would be happy to be corrected if I have gotten something wrong.

[EDIT 2016-02-04 Added a bunch of companies.]

Louhia analytics consulting (predictive analytics, Azure ML etc.)
BIGDATAPUMP analytics consulting (Hadoop, AWS, cloud etc.)
Houston Analytics analytics consulting (analytics partner of IBM)
Top Data Science analytics and IT consulting
Gofore IT architecture
Digia IT consulting
Techila Technologies distributed computing middleware
CGI IT consulting, multinational
Teradata data warehousing, multinational
Avanade IT consulting, multinational
Deloitte financial consulting, multinational
Information Builders business intelligence, multinational
SAS Institute analytics software, multinational
Tieto IT services, multinational (but originally Finnish)
Aureolis business intelligence
Olapcon business intelligence
Big Data Solutions business intelligence
Enfo Rongo business intelligence
Bilot business intelligence
Affecto digital services
Siili digital services
Reaktor digital services
Valuemotive digital services
Solita digital services
Comptel digital services?
Dagmar marketing
Frankly Partners marketing
ROIgrow marketing
Probic marketing
Avaus marketing
InlineMarket marketing automation
Steeri customer analytics
Tulos Helsinki customer analytics
Andumus customer analytics
Avarea customer analytics
Big Data Scoring customer analytics
Suomen Asiakastieto credit & risk management
Silta HR analytics
Quva industrial analytics
Ibisense industrial analytics
Ramentor industrial analytics
Indalgo manufacturing analytics
TTS-Ciptec optimization, sensor
SimAnalytics Logistics, simulation
Relex supply chain analytics
Analyse2 assortment planning
Genevia bioinformatics consultancy
Fonecta directory services
Monzuun analytics as a service
Solutive data visualization
Omnicom communications agency
NAPA naval analytics, ship operations
Primor consulting telecom?

There was an interesting comment saying that CGI manages its global data science “virtual team” from Finland and that they employ several successful Kagglers, one of whom was rated #37 out of 450000 Kaggle users in 2014.

On a personal note, I was happy to find a commercial company (Genevia) which appears to do pretty much the same thing as I do in my day job at Scilifelab Stockholm, that is, bioinformatics consulting (often with an emphasis on high throughput sequencing), except that I do it in an academic context.




Some interesting company announcements

  • Algorithmia, the open marketplace for algorithms, is now live. I find it an interesting concept: to build a community around algorithm development, where users can build on each other’s algorithms and make them available as a web service.
  •  SolveBio is also in public beta as of today. Please refer to an older post for some hints on how to use the API.
  • ForecastThis just launched their DSX platform for automated model testing and building. Looked pretty impressive from a small trial with a tricky dataset I have – a large number of models was constructed and run on the dataset with a set of metrics reported for each, and many of those looked better than the ones I had from e g random forest, but the trial version of the platform does not include access to the actual models, so I wasn’t able to see the details.
  • Seven Bridges Genomics will release the first graph-based human reference genome and related tools. There has been a lot of talk about graph-based genome references (a graph is a natural way to represent the many kinds of natural variation found among human genomes), but the tools to handle them have been lacking.

Follow the Data podcast, episode 4: Self-tracking with Niklas Laninge

In this episode of our podcast, we shift our focus from the “big data” themes in episodes 1-3 to personal data and self-tracking. We talked to Niklas Laninge, founder of Psykologifabriken (“The Psychology Factory”) and COO of Hoa’s Tool Shop, which are both relatively new startups based in Stockholm and which use applied psychology in innovate ways to facilitate lasting behavior change – in the case of the latter company, using digital tools such as smart phone apps. Niklas is also an avid collector of data on himself and describes some things he has found out by analyzing those data – and remarks that “When my [Nike] Fuelband broke, part of myself broke as well.”

At one point, I (Mikael) miserably failed to get the details right about The Human Face of Big Data project, which I erroneously call “Faces of Big Data” in the podcast. Also, I said that it was created by Greenplum, when in fact it was developed by Against All Odds productions (Rick Smolan and Jennifer Erwitt) and sponsored by EMC (of which Greenplum is a division.)

Some of the things we discussed:

Viary, a tools that facilitates behavior change in organizations or individuals

– Clinical trials showing promising results from using Viary to treat depression

– “Dance-offs” as a fun way to interact with people on the dance floor and get an extreme exercise session

Listen to the podcast | Follow The Data #4 : Self Tracking with Niklas Laninge

Synapse – a Kaggle for molecular medicine?

I have frequently extolled the virtues of collaborative crowdsourced research, online prediction contests and similar subjects on these pages. Almost 2 years ago, I also mentioned Sage Bionetworks, which had started some interesting efforts in this area at the time.

Last Thursday, I (together with colleagues) got a very interesting update on what Sage is up to at the moment, and those things tie together a lot of threads that I am interested in – prediction contests, molecular diagnostics, bioinformatics, R and more. We were visited by Adam Margolin, who is director of computational biology at Sage (one of their three units).

He described how Sage is compiling and organizing public molecular data (such as that contained in The Cancer Genome Atlas) and developing tools for working with it, but more importantly, that they had hit upon prediction contests as the most effective way to generate modelling strategies for prognostic and diagnostic applications based on these data. (As an aside, Sage now appears to be focusing mostly on cancers rather than all types of disease as earlier; applications include predicting cancer subtype severity and survival outcomes.) Adam thinks that objectively scored prediction contests lets researchers escape from the “self-assessment trap“, where one always unconsciously strives to present the performance of one’s models in the most positive light.

They considered running their competitions on Kaggle (and are still open to it, I think) but given that they already had a good infrastructure for reproducible research, Synapse, they decided to tweak that instead and run the competitions on their own platform. Also, Google donated 50 million core hours (“6000 compute years”) and petabyte-scale storage for the purpose.

There was another reason not to use Kaggle as well. Sage wanted participants to not only upload predictions for which the results is shown on a dynamic leaderboard (which they do), but also to force them to provide runnable code which is actually executed on the Sage platform to generate the predictions. The way it works is that competitors need to use R to build their models, and they need to implement two methods, customTrain() and customPredict() (analogous to the train() and predict() methods implemented by most or all statistical learning methods in R) which are called by the server software. Many groups do not like to use R for their model development but there are ways to easily wrap arbitrary types of code inside R.

The first full-scale competition run on Synapse (which is, BTW, not only a competition platform but a “collaborative compute space that allows scientists to share and analyze data together”, as the web page states) was the Sage/DREAM Breast Cancer Prognosis Challenge, which uses data from a cohort of almost 2,000 breast cancer patients. (The DREAM project is itself worthy of another blog post as a very early (in its seventh year now, I think) platform for objective assessment of predictive models and reverse engineering in computational biology, but I digress …)

The goal of the Sage/DREAM breast cancer prognosis challenge is to find out whether it is possible to identify reliable prognostic molecular signatures for this disease. This question, in a generalized form (can we define diseases, subtypes and outcomes from a molecular pattern?), is still a hot one after many years of a steady stream of published gene expression signatures that have usually failed to replicate, or are meaningless (see e g Most Random Gene Expression Signatures Are Significantly Associated with Breast Cancer Outcome). Another competition that I plugged on this blog, SBV Improver, also had as its goal to assess if informative signatures could be found and its outcomes were disclosed recently. The result there was that out of four diseases addressed (multiple sclerosis, lung cancer, psoriasis, COPD), the molecular portrait (gene expression pattern) for one of them (COPD) did not add any information at all to known clinical characteristics, while for the others the gene expression helped to some extent, notably in psoriasis where it could discriminate almost perfectly between healthy and diseased tissue.

In the Sage/DREAM challenge, the cool thing is that you can directly (after registering an account) lift the R code from the leaderboard and try to reproduce the methods. The team that currently leads, Attractor Metagenes, has implemented a really cool (and actually quite simple) approach to finding “metagenes” (weighted linear combinations of actual genes) by an iterative approach that converges to certain characteristic metagenes, thus the “attractor” in the name. There is a paper on arXiv outlining the approach. Adam Margolin said that the authors have had trouble getting the paper published, but the Sage/DREAM competition has at least objectively shown that the method is sound and it should find its way into the computational biology toolbox now. I for one will certainly try it for some of my work projects.

The fact that Synapse stores both data and models in an open way has some interesting implications. For instance, the models can be applied to entirely new data sets, and they can be ensembled very easily (combined to get an average / majority vote / …). In fact, Sage even encourages competitors to make ensemble versions of models on the leaderboard to generate new models while the competition is going on! This is one step beyond Kaggle. Indeed, there is a team (ENSEMBLE) that specializes in this approach and they are currently at #2 on the leaderboard after Attractor Metagenes.

In the end, the winning team will be allowed to publish a paper about how they did it in Science Translational Medicine without peer review – the journal (correctly I think) assumes that the rigorous evaluation process in Synapse is more objective that peer review. Kudos to Science Translational Medicine for that.

There’s a lot more interesting things to mention, like how Synapse is now tackling “pan-cancer analysis” (looking for commonalities between *all* cancers), how they looked at millions of models to find out general rules of thumb about predictive models (discretization makes for worse performance, elastic net algorithms work best on average, prior knowledge and feature engineering is essential for good performance, etc.)
Perhaps the most remarkable thing in all of this, though, is that someone has found a way to build a crowdsourced card game, The Cure, on top of the Sage/DREAM breast cancer prognosis challenge in order to find even better solutions. I have not quite grasped how they did this – the FAQ states:

TheCure was created as a fun way to solicit help in guiding the search for stable patterns that can be used to make biologically and medically important predictions. When people play TheCure they use their knowledge (or their ability to search the Web or their social networks) to make informed decisions about the best combinations of variables (e.g. genes) to use to build predictive patterns. These combos are the ‘hands’ in TheCure card game. Every time a game is played, the hands are evaluated and stored. Eventually predictors will be developed using advanced machine learning algorithms that are informed by the hands played in the game.

But I’ll try The Cure right now and see if I can figure out what it is doing. You’re welcome to join me!

Follow the Data podcast, episode 2: King of BigData

In the second episode of the FTD podcast, we talked to big data consultant Johan Pettersson (his company is actually called Big Data AB; what a catch to be able to obtain that name despite the Swedish trademark regulations!) and Thomas Hartwig, CTO of, a company that produces “skill games” where you can win money by being more skillful than your competitors.

We knew practically nothing about coming into the interview and were surprised to learn that they are the second biggest game producer on Facebook! Some other things of note from the interview:

  • currently captures about 1.5 billion game events each day from about 12 million users per day;
  • They don’t have a dedicated data analysis group but rather an “embedded analyst”  in each developer team (each game has its own team);
  • Johan Pettersson does not think the demand for big data specialists or data scientists in Sweden is that high at the moment (although everyone is talking about “big data”, almost no one is really working with it), but it will probably be in 1-2 years.
  • However, good data analysts are in high demand and therefore hard to find.

Podcast link: Follow the Data | Episode 2 – King of BigData Podcast

The “AfterDark” discussion afterwards (in Swedish)
Follow the Data | Episode 2 – King of BigData After Dark

Follow the Data podcast, episode 1: Gavagai! Gavagai!

We have made available the first episode of the Follow the Data podcast! Hope you enjoy it.

Podcast link: Follow The Data | Episode 1 – Gavagai! Gavagai!

This first episode, as has been mentioned before on this blog, is about a Stockholm startup company, Gavagai, which provides a technology platform called Ethersource. We interviewed the company’s CDO (chief data officer), Fredrik Olsson, and the chief scientist, Magnus Sahlgren, and we think it resulted in a very interesting chat, although the sound quality is perhaps not ideal due to our inexperience with podcasting.

Some interesting tidbits from the conversation:

  • The name “Gavagai” comes from a thought experiment by Quine demonstrating the “indeterminacy of translation“. It’s also the reason for the presence of the little rabbit on the Gavagai web page.
  • Olsson describes Ethersource as a “semantic processing layer of the big data stack” and a “base technology for semantics.” An alternative, more everyday description would be the one in this nice interview from Scandinavian Startups: “Finding meaning before it is evident.”
  • Ethersource learns meaning from text, which is the core of the technology; use cases include “sentiment analysis on steroids”, textual profiling and market analysis.
  • The Ethersource system is based on intrinsically scalable technology (which toward the end of the podcast turned out to be based on mimicking computation in the brain and “sparse distributed representation”) which can ingest any type of linguistic data stream; Gavagai have not been able to “saturate the system” in terms of storage despite ingesting everything they can get their hands on. The underlying technology is based on “random indexing” which is basically a kind of random projection approach (according to Sahlgren); a dimensionality reduction method which allows incremental processing (rather than, e.g., running huge SVDs.)
  • As a result of the underlying design, Ethersource builds up representations of concepts as it incorporates new data; Gavagai formulates this in the phrase “training equals learning.” The concept-based approach means that the system is extremely good at handling spelling errors and synonyms.
  • Ethersource is not based on concepts such as “documents” or “tweets”, which are completely artificial, according to chief scientist Sahlgren.
  • The system’s design also means that it does not have any problems handling different languages, even languages that use different text encodings.
  • Gavagai did not start out as a “big data” company but they are now relatively comfortable in their role as one.
  • Fredrik Olsson used to work for Recorded Future, which he feels is not a competitor to Gavagai, but would be a perfect customer.

Me and Joel were perhaps not very comfortable in our new roles as podcasters and struggled a bit with finding the right words in English. We also recorded a post-show chat in Swedish where we are more relaxed and coherent. Some tidbits from this part, which we also plan to put online at some point:

  • The Gavagai founders have a radical view of linguistics, where there is no hard line between syntax and semantics, but rather a kind of continuum.
  • They don’t believe in sampling, but try to ingest everything they can find into the system.
  • The Gavagai team tries to put aside some time every day to look at interesting concepts and connections between concepts discovered by the system.
  • They expected that a word like apple (Apple) would have a large number of different meanings, but when they looked at data from social media during a specific period in time, it had just three major meanings.
  • Language does its own disambiguation; for example, after Apple has become well-known as a software company, people have started to talk more about “apples” rather than “an apple” when they mean the fruit (if I interpreted Magnus correctly).
  • They view the stock market as a way to validate their semantic analysis. “Stock prices are the closest you can get to an objective validation.”
  • The founders came from a research background, and found that starting Gavagai gave a huge boost to their research activities due to the new pressure to build and release something that works in the “real world”

In the evening of the day of the interview (March 9, 2012), Swedish daily Svenska Dagbladet released an article about Gavagai’s Ethersource-based real-time sentiment tracking of the buzz around the contestants who would appear in the Swedish Eurovision finals the following day. In the end, the Ethersource forecasts turned out to be very accurate.

Although it’s far from clear what the next episodes of the podcast will be about, in general we will restrict ourselves to interviewing interesting companies or scientists (rather than just talking amongst ourselves), with a bias towards Swedish interviewees since this is where we are located and it might be interesting for people from other locations to hear what is going on here.

EDIT 17/3 2012: Our podcast jingle was created by Karl Ekdahl, the man behind the awesome Ekdahl Moisturizer, among many other things.

Hello 2012!

The first blog post of the new year. I made some updates to the Swedish big data company list from last year. I’ll recap the additions here so you don’t have to click on that link –

  • Markify is a service that searches a large set of databases for registered trademarks that are similar, in sound or in writing, to a given query – like a name you have thought up for your next killer startup. As described on the company’s website, determining similarity is not that clear-cut, so (according to this write-up) they have adopted a data-driven strategy where they train their algorithm on “actual case literature of disputed trademark claims to help it discover trademarks that were similar enough to be contested.” They claim it’s the worl’d most accurate comprehensive trademark search.
  • alaTest compiles, analyzes and rates product reviews to help customers select the most suitable product for them.
  • Intellus is a business process / business intelligence company. Frankly, these terms and web sites like theirs normally make me fall asleep, but they have an ad for a master’s project out where they propose research to “find and implement an effective way of automating analysis in non-normalized data by applying different approaches of machine learning”, where the “platform for distributed big data analysis is already in place.” They promise a project at “the bleeding edge technology of machine learning and distributed big data analysis.”
  • Although I haven’t listed AstraZeneca as a “big data” company (yet), they seem to be jumping the “data science” train as they are now advertising for “data angels” (!) and “predictive science data experts.”

On the US stage, I’m curious about a new company called BigML, which is apparently trying to tackle a problem that many have thought about or tried to solve, but which has proven very difficult, that is, to provide a user-friendly and general solution for building predictive models based on a data source. A machine learning solution for regular people, as it were. This blog post talks about some of the motivations behind it. I’ve applied for an invite and will write up a blog post if I get the chance to try it.

Finally, I’d like to recommend this Top 10 data mining links of 2011 list. I’m not usually very into top-10 lists, but this one contained some interesting stuff that I had missed. Of course, there is the MIC/MINE method which was published in Science, a clever generalization of correlation that works for non-linear relationships (to over-simplify a bit).  As this blog post puts it, “the consequential metric goes far beyond traditional measures of correlation, and rather towards what I would think of as a general pattern recognition algorithm that is sensitive to any type of systematic pattern between two variables (see the examples in Fig. 2 of the paper).”

Then there are of course the free data analysis textbooks, the free online ML and AI courses and IBM’s systems that defeated human Jeopardy champions, all of which I have covered here (I think.) Finally, there are links to two really cool papers. The first of them, Graphical Inference for Infoviz (where one of the authors is R luminary Hadley Wickham), introduces a very interesting method of “visual hypothesis testing” based on generating “decoy plots” that are based on the null hypothesis distribution, and letting a test person pick out the actual observed data among the decoys. The procedure has been implemented in an R package called nullabor. I really liked their analogy between hypothesis testing and a trial (the term “the statistical justice system”!):

Hypothesis testing is perhaps best understood with an analogy to the criminal justice system. The accused (data set) will be judged guilty or innocent based on the results of a trial (statistical test). Each trial has a defense (advocating for the null hypothesis) and a prosecution (advocating for the alternative hypothesis). On the basis of how evidence (the test statistic) compares to a standard (the p-value), the judge makes a decision to convict (reject the null) or acquit (fail to reject the null hypothesis). Unlike the criminal justice system, in the statistical justice system (SJS) evidence is based on the similarity between the accused and known innocents, using a specific metric defined by the test statistic. The population of innocents, called the null distribution, is generated by the combination of null hypothesis and test statistic. To determine the guilt of the accused we compute the proportion of innocents who look more guilty than the accused. This is the p-value, the probability that the accused would look this guilty if they actually were innocent.

The other very cool article is from Gary King’s lab and deals with the question of comparing different clusterings of data, and specifically determining a useful or insightful clustering for the user. They did this by implementing all (!) known clustering methods plus some new ones in a common interface in an R package. They then cluster text documents using all clustering methods and project the clusterings into a space that can be visualized and interactively explored to get a feeling for what the different methods are doing.

Big data companies in Sweden

Edit in Feb 2015

A lot of people are still visiting this page. Bear in mind that it was written about four years ago, in 2011, and is unlikely to give a good overview of data-driven companies in Sweden today. Thank you for your cooperation. I may try to update it in some form at some point.

Alphabetically ordered list (see below for context & edit history)

AdAction – ad optimization – Stockholm

alaTest – product review comparisons – Stockholm

Augify – Real-time information capture, interpretation and visualization (or something like that) – Stockholm

Big Data AB – big data consultancy – Stockholm

Brummer – hedge fund – Stockholm

Burt – ad optimization – Gothenburg

Campanja – Online advertising – Stockholm

Experlytics – health analytics – Malmö

ExpertMaker – search, recommendation and discovery – Malmö

Gavagai – text analysis – Stockholm

Intellus – business intelligence – Stockholm

Keybroker – ad optimization – Stockholm – online gaming – Stockholm

Klarna – online payment services – Stockholm

Markify – trademark search – Stockholm

NeoTechnology – graph database development – Malmö (?)

Recorded Future – temporal analytics – Gothenburg

Saplo – text analysis – Malmö

Spotify – music streaming service – Stockholm (HQ), Gothenburg

Svensk lånemarknad – Helps customers find the best loans – Stockholm

Tailsweep – market communication – Stockholm

Tink – Stockholm

Tripbirds – “social hotel booking” – Stockholm


When I was at the Strata conference earlier this spring, I noticed there were very few European participants. While the US and Silicon Valley in particular seemed to be going nuts over “big data” and analytics, I haven’t seen much buzz in Sweden (where I currently live) or in Europe as a whole. This question thread on Quora seems to confirm that there really aren’t that many European big data companies. If there are few in Europe, there should be essentially none in little Sweden. So I thought it would be fun to round up the Swedish big data-related companies that I’m aware of.

I know as well as anyone that the term “big data” is not well defined and that I have inevitably missed many companies (false negative) and perhaps included some that don’t consider themselves big data companies (false positives, at least from their point of view!). I’d be very happy to get feedback from companies either way. By the way, there is a nice discussion (also on Quora) about what the big data space looks like as a whole, mostly from a US perspective. But back to Sweden!

Recorded Future, which I’ve blogged about repeatedly and which has also been covered in Wired (in a rather hyperbolic way, one might add), has development offices in Gothenburg. Its founder Christopher Ahlberg previously created the useful analytics/visualization package Spotfire, which has since been sold to Tibco.

Spotfire sounds a bit like Spotify, which is of course a very popular music streaming service that has taken at least Sweden (and soon the world?) by storm. I know I have moved maybe 80% of my music listening time into Spotify. Like the “old” music recommendation/social network service, Spotify has a lot of interesting data that could be mined in various ways. They are currently looking for people who know things like Hadoop and Python.

Klarna was covered in a recent Economist article about data-driven finance. It seems to be an interesting company that tries to do everything in Erlang, a functional programming language first developed at Ericsson. Klarna allows customers to shop online by typing their date of birth, name and address – they don’t actually pay until they have received the goods. This is made possible by combining and analyzing data that goes way beyond conventional credit scores.

There are a couple of ad optimization companies (surprise surprise) that hire big data experts. I’ve already blogged about one of them, Burt. Apparently, last fall they were looking for a “big data wizard” to work with Hadoop, Pig, HBase etc. AdAction were recently looking for Hadoop programmers. Keybroker have announced positions for people who know Hadoop, EC2 and Ruby on Rails.

I’m sure that many hedge funds and other financial companies use a lot of big data methods. I’ve heard that some hedge funds within Brummer work extensively with machine learning.

The only real tech company in this roundup is NeoTechnology, the creators of the graph database Neo4j. There was quite a buzz around this graph engine/noSQL database at the Strata conference.

There are also a couple of really interesting natural language processing oriented companies (perhaps Recorded Future could also have fitted into this slot): Saplo, which offers a text analysis API with functionality for things like entity tagging, sentiment analysis, similarity search and context recognition. And Gavagai, a Stockholm-based company, doesn’t offer much information on their web page except for saying that they “… develop and employ automated and scalable methods for retrieving actionable intelligence from dynamic data.” Having met the founders through work a long time ago, I bet they are doing something really cool, though.

Again, I’d be happy to get more suggestions! Any other Swedish big data companies out there?

Update 2011-04-08 Per Mellqvist suggested the addition of Tailsweep, a marketing communication company that describes itself as a “leading media channel in blogs and social media”. According to Per, they have been Hadoop users for a while already. Apparently the name Tailsweep comes from the notion of wanting to “sweep” the long tail of social media.

Update 2012-01-15 Benoit Fallenius suggested the addition of Markify, a “name-screening tool” which identifies registered trademarks that look or sound similar to a given query (like a name you are considering for your new company). They use an algorithm which is trained on actual case literature of disputed trademark claims.

I also found a company called alaTest, which compiles and analyzes product reviews to help customers select the most suitable product for them. In their own words, they do  “statistical analysis of review data, statistical matching between products and reviews, natural language processing, data mining and opinion mining. All built on a scalable infrastructure using open source software and modern web technologies like Tornado, Solr and REST.”

Also, Intellus seems to be some sort of business intelligence company, which I mention here because they have an advert out for a master’s thesis project ” […] in the bleeding edge technology of machine learning and distributed big data analysis.”

Update 2012-05-19 Based on our latest podcast episode, we should add (online gaming) and Big Data AB (consultancy).

Update 2012-06-13 Augify is a Stockholm-based startup that aims to “capture, index and store large amounts of fast-changing data in real time”, “tell stories using data visualization and interactive infographics”, etc.

Update 2012-09-06

Tink seems to be in stealth mode, “looking to expand [their] team with brilliant backend developers/data scientists” as of 2012-09

Campanja – online advertising, use a lot of AI and Erlang

Svensk lånemarknad – helps customers find the best loans, looking for predictive analyst as of 2012-09

Update 2012-09-09

Tripbirds – “social hotel booking”

Update 2013-08-26

Experlytics – electronic data capture and prediction for medicine

ExpertMaker – AI based search, recommendation and discovery

Network medicine startups

There are two (well, I’m sure there are really more) interesting new startups that combine medicine with networks, albeit in different ways. NuMedii (which appears to be shorthand for New Indications of Medicines) uses a data-driven approach to discover new indications for previously existing drugs. This is potentially very useful because existing drugs have gone through rigorous tests for toxicity etc. and are therefore easier to bring to the market rather than developing a drug from scratch. NuMedii’s technology is based on academic work from Stanford and they have a killer team that includes the likes of Atul Butte and Eric Schadt. The company is currently looking for what is essentially a bioinformatics-slanted big data scientist; one of the responsibilities related to this position is to “Architect, develop, maintain, and document a computational infrastructure that efficiently executes complex queries across many terabytes (potentially petabytes!) of disparate data and knowledge on genomics, genetics, pharmaceuticals, and chemicals.” Petabytes!

MedNetworks is also interesting, though a bit different. Its technology is based on the well-publicized work of Nicholas Christakis and colleagues at Harvard about how things like smoking and obesity appear to spread in social networks in an almost contagious way. (As an aside, I saw a random hipster at a Stockholm café sporting a copy of Christakis’ and Fowler’s book Connected: The Surprising Power of Out Social Networks – maybe network science is belatedly going mainstream here too!) MedNetworks studies things like how prescriptions of drugs are affected by the structures of social networks of physicians and patients. They attempt to identify “high influencers” in social networks, which is not necessarily the same as highly connected people. These high influencers have a strong influence on how drug prescribing behavior “diffuses” in a social network. Quoting the company website: “Optimized targeting for promotion based on social network influence provides a more efficient and effective approach to both personal and non-personal promotion.”

Post Navigation