Follow the Data

A data driven blog

Archive for the category “Companies”

Follow the Data podcast, episode 4: Self-tracking with Niklas Laninge

In this episode of our podcast, we shift our focus from the “big data” themes in episodes 1-3 to personal data and self-tracking. We talked to Niklas Laninge, founder of Psykologifabriken (“The Psychology Factory”) and COO of Hoa’s Tool Shop, which are both relatively new startups based in Stockholm and which use applied psychology in innovate ways to facilitate lasting behavior change – in the case of the latter company, using digital tools such as smart phone apps. Niklas is also an avid collector of data on himself and describes some things he has found out by analyzing those data – and remarks that “When my [Nike] Fuelband broke, part of myself broke as well.”

At one point, I (Mikael) miserably failed to get the details right about The Human Face of Big Data project, which I erroneously call “Faces of Big Data” in the podcast. Also, I said that it was created by Greenplum, when in fact it was developed by Against All Odds productions (Rick Smolan and Jennifer Erwitt) and sponsored by EMC (of which Greenplum is a division.)

Some of the things we discussed:

- Viary, a tools that facilitates behavior change in organizations or individuals

- Clinical trials showing promising results from using Viary to treat depression

- “Dance-offs” as a fun way to interact with people on the dance floor and get an extreme exercise session

Listen to the podcast | Follow The Data #4 : Self Tracking with Niklas Laninge

Synapse – a Kaggle for molecular medicine?

I have frequently extolled the virtues of collaborative crowdsourced research, online prediction contests and similar subjects on these pages. Almost 2 years ago, I also mentioned Sage Bionetworks, which had started some interesting efforts in this area at the time.

Last Thursday, I (together with colleagues) got a very interesting update on what Sage is up to at the moment, and those things tie together a lot of threads that I am interested in – prediction contests, molecular diagnostics, bioinformatics, R and more. We were visited by Adam Margolin, who is director of computational biology at Sage (one of their three units).

He described how Sage is compiling and organizing public molecular data (such as that contained in The Cancer Genome Atlas) and developing tools for working with it, but more importantly, that they had hit upon prediction contests as the most effective way to generate modelling strategies for prognostic and diagnostic applications based on these data. (As an aside, Sage now appears to be focusing mostly on cancers rather than all types of disease as earlier; applications include predicting cancer subtype severity and survival outcomes.) Adam thinks that objectively scored prediction contests lets researchers escape from the “self-assessment trap“, where one always unconsciously strives to present the performance of one’s models in the most positive light.

They considered running their competitions on Kaggle (and are still open to it, I think) but given that they already had a good infrastructure for reproducible research, Synapse, they decided to tweak that instead and run the competitions on their own platform. Also, Google donated 50 million core hours (“6000 compute years”) and petabyte-scale storage for the purpose.

There was another reason not to use Kaggle as well. Sage wanted participants to not only upload predictions for which the results is shown on a dynamic leaderboard (which they do), but also to force them to provide runnable code which is actually executed on the Sage platform to generate the predictions. The way it works is that competitors need to use R to build their models, and they need to implement two methods, customTrain() and customPredict() (analogous to the train() and predict() methods implemented by most or all statistical learning methods in R) which are called by the server software. Many groups do not like to use R for their model development but there are ways to easily wrap arbitrary types of code inside R.

The first full-scale competition run on Synapse (which is, BTW, not only a competition platform but a “collaborative compute space that allows scientists to share and analyze data together”, as the web page states) was the Sage/DREAM Breast Cancer Prognosis Challenge, which uses data from a cohort of almost 2,000 breast cancer patients. (The DREAM project is itself worthy of another blog post as a very early (in its seventh year now, I think) platform for objective assessment of predictive models and reverse engineering in computational biology, but I digress …)

The goal of the Sage/DREAM breast cancer prognosis challenge is to find out whether it is possible to identify reliable prognostic molecular signatures for this disease. This question, in a generalized form (can we define diseases, subtypes and outcomes from a molecular pattern?), is still a hot one after many years of a steady stream of published gene expression signatures that have usually failed to replicate, or are meaningless (see e g Most Random Gene Expression Signatures Are Significantly Associated with Breast Cancer Outcome). Another competition that I plugged on this blog, SBV Improver, also had as its goal to assess if informative signatures could be found and its outcomes were disclosed recently. The result there was that out of four diseases addressed (multiple sclerosis, lung cancer, psoriasis, COPD), the molecular portrait (gene expression pattern) for one of them (COPD) did not add any information at all to known clinical characteristics, while for the others the gene expression helped to some extent, notably in psoriasis where it could discriminate almost perfectly between healthy and diseased tissue.

In the Sage/DREAM challenge, the cool thing is that you can directly (after registering an account) lift the R code from the leaderboard and try to reproduce the methods. The team that currently leads, Attractor Metagenes, has implemented a really cool (and actually quite simple) approach to finding “metagenes” (weighted linear combinations of actual genes) by an iterative approach that converges to certain characteristic metagenes, thus the “attractor” in the name. There is a paper on arXiv outlining the approach. Adam Margolin said that the authors have had trouble getting the paper published, but the Sage/DREAM competition has at least objectively shown that the method is sound and it should find its way into the computational biology toolbox now. I for one will certainly try it for some of my work projects.

The fact that Synapse stores both data and models in an open way has some interesting implications. For instance, the models can be applied to entirely new data sets, and they can be ensembled very easily (combined to get an average / majority vote / …). In fact, Sage even encourages competitors to make ensemble versions of models on the leaderboard to generate new models while the competition is going on! This is one step beyond Kaggle. Indeed, there is a team (ENSEMBLE) that specializes in this approach and they are currently at #2 on the leaderboard after Attractor Metagenes.

In the end, the winning team will be allowed to publish a paper about how they did it in Science Translational Medicine without peer review – the journal (correctly I think) assumes that the rigorous evaluation process in Synapse is more objective that peer review. Kudos to Science Translational Medicine for that.

There’s a lot more interesting things to mention, like how Synapse is now tackling “pan-cancer analysis” (looking for commonalities between *all* cancers), how they looked at millions of models to find out general rules of thumb about predictive models (discretization makes for worse performance, elastic net algorithms work best on average, prior knowledge and feature engineering is essential for good performance, etc.)
Perhaps the most remarkable thing in all of this, though, is that someone has found a way to build a crowdsourced card game, The Cure, on top of the Sage/DREAM breast cancer prognosis challenge in order to find even better solutions. I have not quite grasped how they did this – the FAQ states:

TheCure was created as a fun way to solicit help in guiding the search for stable patterns that can be used to make biologically and medically important predictions. When people play TheCure they use their knowledge (or their ability to search the Web or their social networks) to make informed decisions about the best combinations of variables (e.g. genes) to use to build predictive patterns. These combos are the ‘hands’ in TheCure card game. Every time a game is played, the hands are evaluated and stored. Eventually predictors will be developed using advanced machine learning algorithms that are informed by the hands played in the game.

But I’ll try The Cure right now and see if I can figure out what it is doing. You’re welcome to join me!

Follow the Data podcast, episode 2: King of BigData

In the second episode of the FTD podcast, we talked to big data consultant Johan Pettersson (his company is actually called Big Data AB; what a catch to be able to obtain that name despite the Swedish trademark regulations!) and Thomas Hartwig, CTO of King.com, a company that produces “skill games” where you can win money by being more skillful than your competitors.

We knew practically nothing about King.com coming into the interview and were surprised to learn that they are the second biggest game producer on Facebook! Some other things of note from the interview:

  • King.com currently captures about 1.5 billion game events each day from about 12 million users per day;
  • They don’t have a dedicated data analysis group but rather an “embedded analyst”  in each developer team (each game has its own team);
  • Johan Pettersson does not think the demand for big data specialists or data scientists in Sweden is that high at the moment (although everyone is talking about “big data”, almost no one is really working with it), but it will probably be in 1-2 years.
  • However, good data analysts are in high demand and therefore hard to find.

Podcast link: Follow the Data | Episode 2 – King of BigData Podcast

The “AfterDark” discussion afterwards (in Swedish)
Follow the Data | Episode 2 – King of BigData After Dark

Follow the Data podcast, episode 1: Gavagai! Gavagai!

We have made available the first episode of the Follow the Data podcast! Hope you enjoy it.

Podcast link: Follow The Data | Episode 1 – Gavagai! Gavagai!

This first episode, as has been mentioned before on this blog, is about a Stockholm startup company, Gavagai, which provides a technology platform called Ethersource. We interviewed the company’s CDO (chief data officer), Fredrik Olsson, and the chief scientist, Magnus Sahlgren, and we think it resulted in a very interesting chat, although the sound quality is perhaps not ideal due to our inexperience with podcasting.

Some interesting tidbits from the conversation:

  • The name “Gavagai” comes from a thought experiment by Quine demonstrating the “indeterminacy of translation“. It’s also the reason for the presence of the little rabbit on the Gavagai web page.
  • Olsson describes Ethersource as a “semantic processing layer of the big data stack” and a “base technology for semantics.” An alternative, more everyday description would be the one in this nice interview from Scandinavian Startups: “Finding meaning before it is evident.”
  • Ethersource learns meaning from text, which is the core of the technology; use cases include “sentiment analysis on steroids”, textual profiling and market analysis.
  • The Ethersource system is based on intrinsically scalable technology (which toward the end of the podcast turned out to be based on mimicking computation in the brain and “sparse distributed representation”) which can ingest any type of linguistic data stream; Gavagai have not been able to “saturate the system” in terms of storage despite ingesting everything they can get their hands on. The underlying technology is based on “random indexing” which is basically a kind of random projection approach (according to Sahlgren); a dimensionality reduction method which allows incremental processing (rather than, e.g., running huge SVDs.)
  • As a result of the underlying design, Ethersource builds up representations of concepts as it incorporates new data; Gavagai formulates this in the phrase “training equals learning.” The concept-based approach means that the system is extremely good at handling spelling errors and synonyms.
  • Ethersource is not based on concepts such as “documents” or “tweets”, which are completely artificial, according to chief scientist Sahlgren.
  • The system’s design also means that it does not have any problems handling different languages, even languages that use different text encodings.
  • Gavagai did not start out as a “big data” company but they are now relatively comfortable in their role as one.
  • Fredrik Olsson used to work for Recorded Future, which he feels is not a competitor to Gavagai, but would be a perfect customer.

Me and Joel were perhaps not very comfortable in our new roles as podcasters and struggled a bit with finding the right words in English. We also recorded a post-show chat in Swedish where we are more relaxed and coherent. Some tidbits from this part, which we also plan to put online at some point:

  • The Gavagai founders have a radical view of linguistics, where there is no hard line between syntax and semantics, but rather a kind of continuum.
  • They don’t believe in sampling, but try to ingest everything they can find into the system.
  • The Gavagai team tries to put aside some time every day to look at interesting concepts and connections between concepts discovered by the system.
  • They expected that a word like apple (Apple) would have a large number of different meanings, but when they looked at data from social media during a specific period in time, it had just three major meanings.
  • Language does its own disambiguation; for example, after Apple has become well-known as a software company, people have started to talk more about “apples” rather than “an apple” when they mean the fruit (if I interpreted Magnus correctly).
  • They view the stock market as a way to validate their semantic analysis. “Stock prices are the closest you can get to an objective validation.”
  • The founders came from a research background, and found that starting Gavagai gave a huge boost to their research activities due to the new pressure to build and release something that works in the “real world”

In the evening of the day of the interview (March 9, 2012), Swedish daily Svenska Dagbladet released an article about Gavagai’s Ethersource-based real-time sentiment tracking of the buzz around the contestants who would appear in the Swedish Eurovision finals the following day. In the end, the Ethersource forecasts turned out to be very accurate.

Although it’s far from clear what the next episodes of the podcast will be about, in general we will restrict ourselves to interviewing interesting companies or scientists (rather than just talking amongst ourselves), with a bias towards Swedish interviewees since this is where we are located and it might be interesting for people from other locations to hear what is going on here.

EDIT 17/3 2012: Our podcast jingle was created by Karl Ekdahl, the man behind the awesome Ekdahl Moisturizer, among many other things.

Hello 2012!

The first blog post of the new year. I made some updates to the Swedish big data company list from last year. I’ll recap the additions here so you don’t have to click on that link -

  • Markify is a service that searches a large set of databases for registered trademarks that are similar, in sound or in writing, to a given query – like a name you have thought up for your next killer startup. As described on the company’s website, determining similarity is not that clear-cut, so (according to this write-up) they have adopted a data-driven strategy where they train their algorithm on “actual case literature of disputed trademark claims to help it discover trademarks that were similar enough to be contested.” They claim it’s the worl’d most accurate comprehensive trademark search.
  • alaTest compiles, analyzes and rates product reviews to help customers select the most suitable product for them.
  • Intellus is a business process / business intelligence company. Frankly, these terms and web sites like theirs normally make me fall asleep, but they have an ad for a master’s project out where they propose research to “find and implement an effective way of automating analysis in non-normalized data by applying different approaches of machine learning”, where the “platform for distributed big data analysis is already in place.” They promise a project at “the bleeding edge technology of machine learning and distributed big data analysis.”
  • Although I haven’t listed AstraZeneca as a “big data” company (yet), they seem to be jumping the “data science” train as they are now advertising for “data angels” (!) and “predictive science data experts.”

On the US stage, I’m curious about a new company called BigML, which is apparently trying to tackle a problem that many have thought about or tried to solve, but which has proven very difficult, that is, to provide a user-friendly and general solution for building predictive models based on a data source. A machine learning solution for regular people, as it were. This blog post talks about some of the motivations behind it. I’ve applied for an invite and will write up a blog post if I get the chance to try it.

Finally, I’d like to recommend this Top 10 data mining links of 2011 list. I’m not usually very into top-10 lists, but this one contained some interesting stuff that I had missed. Of course, there is the MIC/MINE method which was published in Science, a clever generalization of correlation that works for non-linear relationships (to over-simplify a bit).  As this blog post puts it, “the consequential metric goes far beyond traditional measures of correlation, and rather towards what I would think of as a general pattern recognition algorithm that is sensitive to any type of systematic pattern between two variables (see the examples in Fig. 2 of the paper).”

Then there are of course the free data analysis textbooks, the free online ML and AI courses and IBM’s systems that defeated human Jeopardy champions, all of which I have covered here (I think.) Finally, there are links to two really cool papers. The first of them, Graphical Inference for Infoviz (where one of the authors is R luminary Hadley Wickham), introduces a very interesting method of “visual hypothesis testing” based on generating “decoy plots” that are based on the null hypothesis distribution, and letting a test person pick out the actual observed data among the decoys. The procedure has been implemented in an R package called nullabor. I really liked their analogy between hypothesis testing and a trial (the term “the statistical justice system”!):

Hypothesis testing is perhaps best understood with an analogy to the criminal justice system. The accused (data set) will be judged guilty or innocent based on the results of a trial (statistical test). Each trial has a defense (advocating for the null hypothesis) and a prosecution (advocating for the alternative hypothesis). On the basis of how evidence (the test statistic) compares to a standard (the p-value), the judge makes a decision to convict (reject the null) or acquit (fail to reject the null hypothesis). Unlike the criminal justice system, in the statistical justice system (SJS) evidence is based on the similarity between the accused and known innocents, using a specific metric defined by the test statistic. The population of innocents, called the null distribution, is generated by the combination of null hypothesis and test statistic. To determine the guilt of the accused we compute the proportion of innocents who look more guilty than the accused. This is the p-value, the probability that the accused would look this guilty if they actually were innocent.

The other very cool article is from Gary King’s lab and deals with the question of comparing different clusterings of data, and specifically determining a useful or insightful clustering for the user. They did this by implementing all (!) known clustering methods plus some new ones in a common interface in an R package. They then cluster text documents using all clustering methods and project the clusterings into a space that can be visualized and interactively explored to get a feeling for what the different methods are doing.

Big data companies in Sweden

Alphabetically ordered list (see below for context & edit history)

AdAction - ad optimization – Stockholm

alaTest - product review comparisons – Stockholm

Augify - Real-time information capture, interpretation and visualization (or something like that) – Stockholm

Big Data AB - big data consultancy – Stockholm

Brummer - hedge fund – Stockholm

Burt - ad optimization – Gothenburg

Campanja - Online advertising – Stockholm

Gavagai - text analysis – Stockholm

Intellus - business intelligence – Stockholm

Keybroker - ad optimization – Stockholm

King.com - online gaming – Stockholm

Klarna - online payment services – Stockholm

Markify - trademark search – Stockholm

NeoTechnology - graph database development – Malmö (?)

Recorded Future - temporal analytics – Gothenburg

Saplo - text analysis – Malmö

Spotify - music streaming service – Stockholm (HQ), Gothenburg

Svensk lånemarknad - Helps customers find the best loans – Stockholm

Tailsweep - market communication – Stockholm

Tink - Stockholm

Tripbirds - “social hotel booking” – Stockholm

***

When I was at the Strata conference earlier this spring, I noticed there were very few European participants. While the US and Silicon Valley in particular seemed to be going nuts over “big data” and analytics, I haven’t seen much buzz in Sweden (where I currently live) or in Europe as a whole. This question thread on Quora seems to confirm that there really aren’t that many European big data companies. If there are few in Europe, there should be essentially none in little Sweden. So I thought it would be fun to round up the Swedish big data-related companies that I’m aware of.

I know as well as anyone that the term “big data” is not well defined and that I have inevitably missed many companies (false negative) and perhaps included some that don’t consider themselves big data companies (false positives, at least from their point of view!). I’d be very happy to get feedback from companies either way. By the way, there is a nice discussion (also on Quora) about what the big data space looks like as a whole, mostly from a US perspective. But back to Sweden!

Recorded Future, which I’ve blogged about repeatedly and which has also been covered in Wired (in a rather hyperbolic way, one might add), has development offices in Gothenburg. Its founder Christopher Ahlberg previously created the useful analytics/visualization package Spotfire, which has since been sold to Tibco.

Spotfire sounds a bit like Spotify, which is of course a very popular music streaming service that has taken at least Sweden (and soon the world?) by storm. I know I have moved maybe 80% of my music listening time into Spotify. Like the “old” music recommendation/social network service last.fm, Spotify has a lot of interesting data that could be mined in various ways. They are currently looking for people who know things like Hadoop and Python.

Klarna was covered in a recent Economist article about data-driven finance. It seems to be an interesting company that tries to do everything in Erlang, a functional programming language first developed at Ericsson. Klarna allows customers to shop online by typing their date of birth, name and address – they don’t actually pay until they have received the goods. This is made possible by combining and analyzing data that goes way beyond conventional credit scores.

There are a couple of ad optimization companies (surprise surprise) that hire big data experts. I’ve already blogged about one of them, Burt. Apparently, last fall they were looking for a “big data wizard” to work with Hadoop, Pig, HBase etc. AdAction were recently looking for Hadoop programmers. Keybroker have announced positions for people who know Hadoop, EC2 and Ruby on Rails.

I’m sure that many hedge funds and other financial companies use a lot of big data methods. I’ve heard that some hedge funds within Brummer work extensively with machine learning.

The only real tech company in this roundup is NeoTechnology, the creators of the graph database Neo4j. There was quite a buzz around this graph engine/noSQL database at the Strata conference.

There are also a couple of really interesting natural language processing oriented companies (perhaps Recorded Future could also have fitted into this slot): Saplo, which offers a text analysis API with functionality for things like entity tagging, sentiment analysis, similarity search and context recognition. And Gavagai, a Stockholm-based company, doesn’t offer much information on their web page except for saying that they “… develop and employ automated and scalable methods for retrieving actionable intelligence from dynamic data.” Having met the founders through work a long time ago, I bet they are doing something really cool, though.

Again, I’d be happy to get more suggestions! Any other Swedish big data companies out there?

Update 2011-04-08 Per Mellqvist suggested the addition of Tailsweep, a marketing communication company that describes itself as a “leading media channel in blogs and social media”. According to Per, they have been Hadoop users for a while already. Apparently the name Tailsweep comes from the notion of wanting to “sweep” the long tail of social media.

Update 2012-01-15 Benoit Fallenius suggested the addition of Markify, a “name-screening tool” which identifies registered trademarks that look or sound similar to a given query (like a name you are considering for your new company). They use an algorithm which is trained on actual case literature of disputed trademark claims.

I also found a company called alaTest, which compiles and analyzes product reviews to help customers select the most suitable product for them. In their own words, they do  “statistical analysis of review data, statistical matching between products and reviews, natural language processing, data mining and opinion mining. All built on a scalable infrastructure using open source software and modern web technologies like Tornado, Solr and REST.”

Also, Intellus seems to be some sort of business intelligence company, which I mention here because they have an advert out for a master’s thesis project ” [...] in the bleeding edge technology of machine learning and distributed big data analysis.”

Update 2012-05-19 Based on our latest podcast episode, we should add King.com (online gaming) and Big Data AB (consultancy).

Update 2012-06-13 Augify is a Stockholm-based startup that aims to “capture, index and store large amounts of fast-changing data in real time”, “tell stories using data visualization and interactive infographics”, etc.

Update 2012-09-06

Tink seems to be in stealth mode, “looking to expand [their] team with brilliant backend developers/data scientists” as of 2012-09

Campanja – online advertising, use a lot of AI and Erlang

Svensk lånemarknad – helps customers find the best loans, looking for predictive analyst as of 2012-09

Update 2012-09-09

Tripbirds – “social hotel booking”

Network medicine startups

There are two (well, I’m sure there are really more) interesting new startups that combine medicine with networks, albeit in different ways. NuMedii (which appears to be shorthand for New Indications of Medicines) uses a data-driven approach to discover new indications for previously existing drugs. This is potentially very useful because existing drugs have gone through rigorous tests for toxicity etc. and are therefore easier to bring to the market rather than developing a drug from scratch. NuMedii’s technology is based on academic work from Stanford and they have a killer team that includes the likes of Atul Butte and Eric Schadt. The company is currently looking for what is essentially a bioinformatics-slanted big data scientist; one of the responsibilities related to this position is to “Architect, develop, maintain, and document a computational infrastructure that efficiently executes complex queries across many terabytes (potentially petabytes!) of disparate data and knowledge on genomics, genetics, pharmaceuticals, and chemicals.” Petabytes!

MedNetworks is also interesting, though a bit different. Its technology is based on the well-publicized work of Nicholas Christakis and colleagues at Harvard about how things like smoking and obesity appear to spread in social networks in an almost contagious way. (As an aside, I saw a random hipster at a Stockholm café sporting a copy of Christakis’ and Fowler’s book Connected: The Surprising Power of Out Social Networks - maybe network science is belatedly going mainstream here too!) MedNetworks studies things like how prescriptions of drugs are affected by the structures of social networks of physicians and patients. They attempt to identify “high influencers” in social networks, which is not necessarily the same as highly connected people. These high influencers have a strong influence on how drug prescribing behavior “diffuses” in a social network. Quoting the company website: “Optimized targeting for promotion based on social network influence provides a more efficient and effective approach to both personal and non-personal promotion.”

Biology-inspired algorithm design and non-obvious news discovery

There is a new Science article which seems really cool, although I haven’t had time to get past the paywall yet. The title is “A Biological Solution to a Fundamental Distributed Computing Problem” and the gist of it is pretty simple: a research group has found that an important procedure in distributed computing, “maximal independent set selection”, has been solved in a simple and efficient way in a kind of fly’s nervous system development. An algorithm based on the process that occurs in the fly’s immature nervous system can be directly applied to a network of sensors, for example.

In other news, Bradford Cross, who started the data-driven flight-delay prediction company FlightCaster, is starting a new company called Woven. It will be about discovering news you are interested in, and the platform will explicitly consider a conundrum that I’ve often been thinking about, which is the following (and possibly mentioned in some earlier blog post): Do you really want to read news that are always perfectly tailored to your interests? Wouldn’t this cause you to miss a lot of interesting information that you get from e.g. browsing the newspaper and “accidentally” reading about things you didn’t know about but which are actually kind of interesting? Bradford Cross mentions this in a recent interview and says that he started to “miss the serendipity that a newspaper provides”.  So far so good, but how to actually implement this kind of quasi-random content exposure (I tend to think of it as a kind of beneficial noise) into a news discovery service? I guess we will soon see what Woven has in mind.

Finally, the PayPal Developer Network (!) has a pretty nice tutorial about analyzing and visualizing the recently released World Bank data using tools like Java servlets, Google Charts and MySQL. The World Bank data would easily deserve a verbose blog post of its own (and I was planning one several months ago) but that will have to wait until I’ve taken a proper look at it.

Identifying migraine triggers and “your genome has a posse”

MyMigraineJournal is an interesting self-tracking site with some statistical weight behind it. The idea is to let migraine sufferers define potential “triggers” of migraines, like red wine or aged cheese, after which they will complete a daily questionnaire about what they ate, drank etc. and whether they had a headache that day (this takes about 3o seconds per day). The site will also try to assess whether any of the triggers seems statistically related to higher (or lower) migraine risk. As outlined here, this is done through logistic regression on one variable (trigger) at a time. The site uses a hierarchical Bayesian model where the prior distribution is initially uniform but will eventually, after enough data has been collected, be derived from the aggregated population of previous users, which I think is a nice touch. They don’t look for interactions between triggers yet, but may add such functionally in the future. A user can download their own complete data in Excel format, or delete some or all of it from the system. I think simple but clever systems like this could prove quite useful to people.

On a related note, Nature Medicine recently ran an interesting article, “Personalized investigation“, about people who use direct-to-consumer genetic tests to learn more about genomes and physiology. The article describes how five early adopters of 23andMe’s SNP tests teamed up to investigate whether SNPs in a gene coding for an enzyme related to vitamin B metabolism were predictive of how the carriers would respond to vitamin supplements. The team then performed a series of experiments where they either took no supplements, took multivitamins, L-methylfolate or a combination of both multivitamins and L-methylfolate. After each phase of the experiment, they took blood tests to measure homocysteine, a biomarker for vitamin B activity.

Now, unless I’m misguided or the news is misreported, such an experiment with five subjects could never get anywhere close to a statistically significant outcome. But with larger cohorts, that could change. A new company called Genomera is developing tools that will allow this kind of self-experimentation study to scale into large numbers of participants. In fact, the Nature Medicine article says that Genomera will “roll out the vitamin study as the first open participatory project under its platform.” As of now, the Genomera site still appears to be mostly under construction, although it does say that the company has trademarked “Your genome has a posse” and related phrases. It sounds like an interesting business concept – I just wish they hadn’t described it as “the Facebook of genomics” in the Nature Medicine article …

Sergey Brin’s new science and IBM’s Jeopardy machine

Two good articles from the mainstream press.

Sergey Brin’s Search for a Parkinson’s Cure deals with the Google co-founders quest to minimize his high hereditary risk of getting Parkinson’s disease (which he found out through a test from 23andme, the company his wife founded) while simultaneously paving the way for a more rapid way to do science.

Brin is proposing to bypass centuries of scientific epistemology in favor of a more Googley kind of science. He wants to collect data first, then hypothesize, and then find the patterns that lead to answers. And he has the money and the algorithms to do it.

This idea about a less hypothesis-driven kind of science, based more on observing correlations and patterns, surfaces once in a while. A couple of years ago, Chris Anderson received a lot of criticism for describing what is more or less the same idea in The End of Theory. You can’t escape the need for some sort of theory or hypothesis, and when it comes to something like Parkinson we just don’t know enough about its physiology and biology yet. However, I think Brin is right in emphasizing the need to get data and knowledge about diseases to circulate more quickly and to try to milk the existing data sets for what they are worth. If nothing else, his frontal attack on Parkinson’s may lead to improved techniques for dealing with über-sized data sets.

Smarter Than You Think is about IBM’s new question-answering system Watson, which is apparently now good enough to be put in an actual Jeopardy competition on US national TV (scheduled to happen this fall). It’s a bit hard to believe, but I guess time will tell.

Most question-answering systems rely on a handful of algorithms, but Ferrucci decided this was why those systems do not work very well: no single algorithm can simulate the human ability to parse language and facts. Instead, Watson uses more than a hundred algorithms at the same time to analyze a question in different ways, generating hundreds of possible solutions. Another set of algorithms ranks these answers according to plausibility; for example, if dozens of algorithms working in different directions all arrive at the same answer, it’s more likely to be the right one.

IBM plans to sell Watson-like systems top corporate customers for sifting through huge document collections.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 46 other followers