Follow the Data

A data driven blog

Archive for the tag “big-data”

“The future belongs to those who own the robots”

I wanted to share a very interesting blog post that Wilhelm Landerholm (@landerholm) wrote today on his Swedish blog, because I think it is worth a wider audience. Google Translate does a pretty good job of translating it (as Daniel Nicorici (@daniel2nico) pointed out on Twitter), but there were a few rough spots, so I produced a lightly edited version of it below with permission from the author. So now I give the word to Wilhelm – please enjoy! (For the record, the statement about robots which also occurs in the name of this blog post was written by Mattias Östmar (@mattiasostmar) on Twitter before appearing in the blog post.)

The Future
Posted on January 21, 2015

This wonderful future, and this constant discussion of what’s to come. There any many prophets, and many who say, in response to events, “What did I tell you?” Naturally without mentioning all those times they were wrong. What will that prophet that recommends you to fix the interest rate at 3.78 percent today say in the future?

In my world, you build models to explain future events. It is therefore always a bit funny to brainstorm about the future of data analysis and the ever so popular concept of Big Data, because what is described is often actually the world we live in today. As an example, today I saw a slide that began with the words, “Those who are best at Big Data are those who will be the best at operations”. But that is not the future – it is a description of how it is today.

I’ve built so many models I would have lost count if I hadn’t documented them. Several of them have been the difference between survival or death of a business. These results achieved by myself and my colleagues have also led to my profession being described as a sexy future profession. I hesitate to concur. Today, no one within my area talks about “creating analysis teams” within the company, even if this is what the Swedish companies are currently doing. Those in my world are busy seeking interfaces between my day-to-day reality and customers. The focus is on automation, on AI and Machine Learning. I am one of those who believe that the future does not have room for the Data Scientist of today. The future belongs to those who can control “robots”. And I am not talking about robots like R2D2, but robots like IBM’s Watson.

Today, I build models for large corporations. For companies where the effects of my work can be enormous. But in the future, these solutions will not restricted to a few large corporations. They will be a part of everything we do. The Internet of Things will generate such volumes of data that analysis must be moved from BIG to MICRO. This will, in turn, mean that the analysis must be automated. This is where we work on the future today. Today I am putting together “self-driving companies” in a Raspberry Pi and a smartphone. In a few years, we will see self-driving cars. However, I am not sure that there will be a market for everything that moves, and that’s why those of us who were early to jump on the Big Data and the Internet of Things train are now looking for contacts with those who own demand.

No matter how many people out there are shouting that the company must build its own analysis team, a BIG DATA UNIT – you’re ten years behind. Because the work that I have done for several customers can now be replaced with a computer for 500 SEK [about 60 USD] at Kjell & Co [a common Swedish chain store for electronic peripherals] and a few [predictive?] models. I focus on being the supplier of the models. The future belongs to those who can automate the analysis. For those who own the robots. Along with those who own demand, they will change, to a large extent, how business is done.

Digital Health Days 2013

The Digital Health Days, which will be held in Stockholm on August 21-22, looks like an event that will touch upon many of the things mused upon in this blog, for instance analytics, gamification and self-tracking in relation to medicine and the life sciences. A quick glance at the program reveals session names like “The New Health Enablers: Mobile Health Solutions, Big Data Analytics, Gamification and Games For Health”, “Digital Health Science”, “Smarter Care and Watson for healthcare”, “Computational Health and Big Data Analytics as tools for life science”, etc.

The conference is a bit pricey for the casual visitor (2990 SEK ~ 345 EUR ~ 450 USD) but has a good discount for students, who’ll only need to pay 490 SEK (~56 EUR / ~73 USD).

BigData.SG and The human face of big data

By an amazing coincidence, I was able to attend a session of the Singapore big data meetup group, BigData.SG, after having attended the NGS Asia 2012 conference here in the Lion City. This group was started earlier this year and tries to meet once a month (a more ambitious schedule than the Stockholm group.) Today, about 40 people were in attendance, and I had a nice time chatting to some of them. The invited speaker was Michael Howard, VP of marketing at Greenplum. He had one nice quip – “big data means so little to so many” and talked a little bit about Chorus, a collaborative data science platform from Greenplum which I hadn’t heard about. He hinted that Chorus and Kaggle have something big going on together – something that will revolutionize the whole crowdsourced prediction “business.” It will be interesting to see what it is.
Earlier today, Howard had announced the Human Face of Big Data project, which has been / will be launched in several cities all over the world today (probably still hasn’t launched in the US).  The project, which “lets people compare themselves to each other”, uses a downloadable app (for Android; the iOS version wasn’t working yet) that you can use to collect data about yourself with. There is “passive data collection”: how far and at what speed you’ve moved, how many Bluetooth hot spots you’ve passed, and so on, and active collection through questions that the app asks you; either “serious” questions such as whether you would modify the genes of your unborn infant if given the opportunity (and if so, what would you improve – immune system, intelligence, …) – apparently men and women answered this very differently – or more open-ended “fantasy” questions.

The app also lets you find your “data doppelganger”, which is of course the user who is most similar to you in terms of the collected data. Howard said that despite the short time since the launch, the app has already yielded interesting information about gender differences and topics of interest.

Stockholm R useR Group inaugural meeting

Yesterday, the Stockholm R useR group had its inaugural meeting, hosted by Statisticon, a statistical consulting firm with offices in the heart of the city. It was a smaller and more intimate affair than the Stockholm Big Data meetup last week, with perhaps 25 people attending. If my memory serves, the entities represented at the meeting were Statisticon itself, the Swedish Institute for Communicable Disease Control, Klarna, Stockholm University (researchers in 3 different fields), the Swedish Pensions Agency, and Karolinska Institutet.

There were two themes that came up again and again: firstly, reproducible dynamic reporting – everyone seemed to either use (and love) or want to learn Sweave (and to some extent knitr), and secondly, managing big data sets in R. Thus it was decided to focus on these for the next meeting: an expert from the group will give a presentation on Sweave, and another group of members will try to collect information on what is available for “big data” in R.

I thought it was interesting to see that the representatives from the Swedish Pensions Agency (there were 3 of them) seemed so committed to R, open source and open data. Nice! It was also mentioned that another employee of the same agency, who wasn’t present, has been developing his own big-data R package for dealing with the 9-million-row table containing pension-related data on all Swedish citizens.

Meetup groups for Big Data & Predictive Modeling and Quantified Self in Stockholm

Two interesting new meetup groups have formed in Stockholm (well, there are other interesting ones but for the purposes of this blog these two are the most exciting):


Sergey Brin’s new science and IBM’s Jeopardy machine

Two good articles from the mainstream press.

Sergey Brin’s Search for a Parkinson’s Cure deals with the Google co-founders quest to minimize his high hereditary risk of getting Parkinson’s disease (which he found out through a test from 23andme, the company his wife founded) while simultaneously paving the way for a more rapid way to do science.

Brin is proposing to bypass centuries of scientific epistemology in favor of a more Googley kind of science. He wants to collect data first, then hypothesize, and then find the patterns that lead to answers. And he has the money and the algorithms to do it.

This idea about a less hypothesis-driven kind of science, based more on observing correlations and patterns, surfaces once in a while. A couple of years ago, Chris Anderson received a lot of criticism for describing what is more or less the same idea in The End of Theory. You can’t escape the need for some sort of theory or hypothesis, and when it comes to something like Parkinson we just don’t know enough about its physiology and biology yet. However, I think Brin is right in emphasizing the need to get data and knowledge about diseases to circulate more quickly and to try to milk the existing data sets for what they are worth. If nothing else, his frontal attack on Parkinson’s may lead to improved techniques for dealing with über-sized data sets.

Smarter Than You Think is about IBM’s new question-answering system Watson, which is apparently now good enough to be put in an actual Jeopardy competition on US national TV (scheduled to happen this fall). It’s a bit hard to believe, but I guess time will tell.

Most question-answering systems rely on a handful of algorithms, but Ferrucci decided this was why those systems do not work very well: no single algorithm can simulate the human ability to parse language and facts. Instead, Watson uses more than a hundred algorithms at the same time to analyze a question in different ways, generating hundreds of possible solutions. Another set of algorithms ranks these answers according to plausibility; for example, if dozens of algorithms working in different directions all arrive at the same answer, it’s more likely to be the right one.

IBM plans to sell Watson-like systems top corporate customers for sifting through huge document collections.

Data hype!

There’s a lot of talk about data now, and it seems to be accelerating. Just during the last few weeks, I’ve been told that:

  • data is money (and money is data). Quote: “In the new data economy, money is just a certain type of data. Nothing more and nothing less.” [Stephan Noller, The Future of Targeting]
  • the data singularity is here. Quote: “The machines all around us — our smart phones, smart cars, and fee-happy bank accounts — are talking, and increasingly we’re being left out of the conversation.” [Michael Driscoll, Dataspora]
  • (big) data is less about size, more about freedom. Quote: “[T]he data renaissance is here. Be a part of it.” [Bradford Cross for TechCrunch]
  • (big) data is the next big thing. Quote: “This is not last year’s data-mining.  This is data-mining on steroids!” [Jonathan Reichental, PwC Innovation Blog]
  • (open) data went worldwide in 2009. Quote: “The cry of “raw data now,” which I made people make in the auditorium,was heard around the world.” [Tim Berners-Lee]
  • open data and cloud data services are at least as important as open source. Quote (actually from Peter Norvig): “We [=Google] don’t have better algorithms than anyone else. We just have more data.” [Tim O’Reilly]
  • the future belongs to companies who combine public data in the right way and offer analytics-based insights. Quote: “How did Flightcaster, a one time Y Combinator startup, put itself in a position to know more about the state of airline operations than the airlines themselves?” [tecosystems]

Of course, we shouldn’t forget IBM’s data baby, who’s out there generating sensor data even before s(he) has been born. A citizen of the future.

Mass e-epidemiology

The LifeGene project, which was recently started in Sweden, may in due time generate one of the most complex and interesting data sets ever. The project will study health, lifestyle and genetics (and much more) in the long term in a cohort of 500.000 (this is not a typo!) individuals. Participants will donate blood samples and be subjected to physical measurements (waist and hip circumference, blood pressure etc), but for a smaller subset of participants the study will really go deep, with global analysis of DNA, RNA, protein, metabolite and toxin levels, as well as epigenomics (simplifying a bit, this means genomic information that is not directly encoded in the DNA sequence). Two testing centres have opened during the fall – one in Stockholm and, more recently, one in Umeå.

Environmental factors will be examined too: “Exposures such as diet, physical activity, smoking, prenatal environment, infections, sleep-disorders, socioeconomic and psychosocial status, to name a few, will be assessed.” The data collection will be done through for instance mobile phones and the web, with sampling rates adjusted based on age and life events. The project consortium calls the approach e-epidemiology.

This might make each participant feel a bit like David Ewing Duncan, the man who decided to try as many genetic, medical and toxicological test on himself as he could, and wrote a book about it. Will they suffer from information overload from self-related data? For the statisticians involved, information overload is a certainty. It will be a tough – but interesting – task to collect, store and mine these data. But exactly this kind of project, which relates hereditary factors to environment and lifestyle and correlates these to outcomes (like disease states), is much needed.

The fourth paradigm

A new book about science in the age of big data, Fourth Paradigm: Data-Intensive Scientific Discovery, is available for downloading (for free). The book was reviewed in Nature today. It’s written by people from Microsoft Research and has a foreword by Gordon Bell, one of the authors of Total Recall: How the E-memory Revolution Will Change Everything.

Data services

There’s been a little hiatus here as I have been traveling. I recently learned that Microsoft has launched Codename “Dallas”, a service for purchasing and managing datasets and web services. It seems they are trying to provide consistent APIs to work with different data from the public and private sectors in a clean way. There’s an introduction here.

This type of online data repository seems to be an idea whose time has arrived – I have previously talked about resources like Infochimps, Datamob and Amazon’s Public Data Sets, and there is also, which I seem to have forgotten to mention. A recent commenter on this blog pointed me to the comprehensive knowledge archive network, which is a “registry of open data and content packages”. Then there are the governmental and municipal data repositories, such as

Another interesting service, which may have a slightly different focus, is Factual, described by founder Gil Elbaz as a “platform where anyone can share and mash open data“. Factual basically wants to list facts, and puts the emphasis on data accuracy, so you can express opinions on and discuss the validity of any piece of data. Factual also claims to have “deeper data technology” which allows users to explore the data in a more sophisticated way compared to other services like the Amazon Open Data Sets, for instance.

Companies specializing in helping users make sense out of massive data sets are, of course, popping up as well. I have previously written about Good Data, and now the launch of a new seemingly similar company,  Data Applied, has been announced.  Like Good Data, Data Applied offers affordable licenses for cloud-based and social data analysis, with a free trial package (though Good Data’s free version seems to offer more – a 10 MB data warehouse and 1-5 users vs Data Applied’s file size of <100 kb for a single user; someone correct me if I am wrong). The visualization capabilities of Data Applied do seem very nice. It’s still unclear to me how different the offerings of these two companies are but time will tell.

Post Navigation