Follow the Data

A data driven blog

Archive for the tag “big-data”

BigData.SG and The human face of big data

By an amazing coincidence, I was able to attend a session of the Singapore big data meetup group, BigData.SG, after having attended the NGS Asia 2012 conference here in the Lion City. This group was started earlier this year and tries to meet once a month (a more ambitious schedule than the Stockholm group.) Today, about 40 people were in attendance, and I had a nice time chatting to some of them. The invited speaker was Michael Howard, VP of marketing at Greenplum. He had one nice quip – “big data means so little to so many” and talked a little bit about Chorus, a collaborative data science platform from Greenplum which I hadn’t heard about. He hinted that Chorus and Kaggle have something big going on together – something that will revolutionize the whole crowdsourced prediction “business.” It will be interesting to see what it is.
Earlier today, Howard had announced the Human Face of Big Data project, which has been / will be launched in several cities all over the world today (probably still hasn’t launched in the US).  The project, which “lets people compare themselves to each other”, uses a downloadable app (for Android; the iOS version wasn’t working yet) that you can use to collect data about yourself with. There is “passive data collection”: how far and at what speed you’ve moved, how many Bluetooth hot spots you’ve passed, and so on, and active collection through questions that the app asks you; either “serious” questions such as whether you would modify the genes of your unborn infant if given the opportunity (and if so, what would you improve – immune system, intelligence, …) – apparently men and women answered this very differently – or more open-ended “fantasy” questions.

The app also lets you find your “data doppelganger”, which is of course the user who is most similar to you in terms of the collected data. Howard said that despite the short time since the launch, the app has already yielded interesting information about gender differences and topics of interest.

Stockholm R useR Group inaugural meeting

Yesterday, the Stockholm R useR group had its inaugural meeting, hosted by Statisticon, a statistical consulting firm with offices in the heart of the city. It was a smaller and more intimate affair than the Stockholm Big Data meetup last week, with perhaps 25 people attending. If my memory serves, the entities represented at the meeting were Statisticon itself, the Swedish Institute for Communicable Disease Control, Klarna, Stockholm University (researchers in 3 different fields), the Swedish Pensions Agency, and Karolinska Institutet.

There were two themes that came up again and again: firstly, reproducible dynamic reporting – everyone seemed to either use (and love) or want to learn Sweave (and to some extent knitr), and secondly, managing big data sets in R. Thus it was decided to focus on these for the next meeting: an expert from the group will give a presentation on Sweave, and another group of members will try to collect information on what is available for “big data” in R.

I thought it was interesting to see that the representatives from the Swedish Pensions Agency (there were 3 of them) seemed so committed to R, open source and open data. Nice! It was also mentioned that another employee of the same agency, who wasn’t present, has been developing his own big-data R package for dealing with the 9-million-row table containing pension-related data on all Swedish citizens.

Meetup groups for Big Data & Predictive Modeling and Quantified Self in Stockholm

Two interesting new meetup groups have formed in Stockholm (well, there are other interesting ones but for the purposes of this blog these two are the most exciting):

Fun!

Sergey Brin’s new science and IBM’s Jeopardy machine

Two good articles from the mainstream press.

Sergey Brin’s Search for a Parkinson’s Cure deals with the Google co-founders quest to minimize his high hereditary risk of getting Parkinson’s disease (which he found out through a test from 23andme, the company his wife founded) while simultaneously paving the way for a more rapid way to do science.

Brin is proposing to bypass centuries of scientific epistemology in favor of a more Googley kind of science. He wants to collect data first, then hypothesize, and then find the patterns that lead to answers. And he has the money and the algorithms to do it.

This idea about a less hypothesis-driven kind of science, based more on observing correlations and patterns, surfaces once in a while. A couple of years ago, Chris Anderson received a lot of criticism for describing what is more or less the same idea in The End of Theory. You can’t escape the need for some sort of theory or hypothesis, and when it comes to something like Parkinson we just don’t know enough about its physiology and biology yet. However, I think Brin is right in emphasizing the need to get data and knowledge about diseases to circulate more quickly and to try to milk the existing data sets for what they are worth. If nothing else, his frontal attack on Parkinson’s may lead to improved techniques for dealing with über-sized data sets.

Smarter Than You Think is about IBM’s new question-answering system Watson, which is apparently now good enough to be put in an actual Jeopardy competition on US national TV (scheduled to happen this fall). It’s a bit hard to believe, but I guess time will tell.

Most question-answering systems rely on a handful of algorithms, but Ferrucci decided this was why those systems do not work very well: no single algorithm can simulate the human ability to parse language and facts. Instead, Watson uses more than a hundred algorithms at the same time to analyze a question in different ways, generating hundreds of possible solutions. Another set of algorithms ranks these answers according to plausibility; for example, if dozens of algorithms working in different directions all arrive at the same answer, it’s more likely to be the right one.

IBM plans to sell Watson-like systems top corporate customers for sifting through huge document collections.

Data hype!

There’s a lot of talk about data now, and it seems to be accelerating. Just during the last few weeks, I’ve been told that:

  • data is money (and money is data). Quote: “In the new data economy, money is just a certain type of data. Nothing more and nothing less.” [Stephan Noller, The Future of Targeting]
  • the data singularity is here. Quote: “The machines all around us — our smart phones, smart cars, and fee-happy bank accounts — are talking, and increasingly we’re being left out of the conversation.” [Michael Driscoll, Dataspora]
  • (big) data is less about size, more about freedom. Quote: “[T]he data renaissance is here. Be a part of it.” [Bradford Cross for TechCrunch]
  • (big) data is the next big thing. Quote: “This is not last year’s data-mining.  This is data-mining on steroids!” [Jonathan Reichental, PwC Innovation Blog]
  • (open) data went worldwide in 2009. Quote: “The cry of “raw data now,” which I made people make in the auditorium,was heard around the world.” [Tim Berners-Lee]
  • open data and cloud data services are at least as important as open source. Quote (actually from Peter Norvig): “We [=Google] don’t have better algorithms than anyone else. We just have more data.” [Tim O'Reilly]
  • the future belongs to companies who combine public data in the right way and offer analytics-based insights. Quote: “How did Flightcaster, a one time Y Combinator startup, put itself in a position to know more about the state of airline operations than the airlines themselves?” [tecosystems]

Of course, we shouldn’t forget IBM’s data baby, who’s out there generating sensor data even before s(he) has been born. A citizen of the future.

Mass e-epidemiology

The LifeGene project, which was recently started in Sweden, may in due time generate one of the most complex and interesting data sets ever. The project will study health, lifestyle and genetics (and much more) in the long term in a cohort of 500.000 (this is not a typo!) individuals. Participants will donate blood samples and be subjected to physical measurements (waist and hip circumference, blood pressure etc), but for a smaller subset of participants the study will really go deep, with global analysis of DNA, RNA, protein, metabolite and toxin levels, as well as epigenomics (simplifying a bit, this means genomic information that is not directly encoded in the DNA sequence). Two testing centres have opened during the fall – one in Stockholm and, more recently, one in Umeå.

Environmental factors will be examined too: “Exposures such as diet, physical activity, smoking, prenatal environment, infections, sleep-disorders, socioeconomic and psychosocial status, to name a few, will be assessed.” The data collection will be done through for instance mobile phones and the web, with sampling rates adjusted based on age and life events. The project consortium calls the approach e-epidemiology.

This might make each participant feel a bit like David Ewing Duncan, the man who decided to try as many genetic, medical and toxicological test on himself as he could, and wrote a book about it. Will they suffer from information overload from self-related data? For the statisticians involved, information overload is a certainty. It will be a tough – but interesting – task to collect, store and mine these data. But exactly this kind of project, which relates hereditary factors to environment and lifestyle and correlates these to outcomes (like disease states), is much needed.

The fourth paradigm

A new book about science in the age of big data, Fourth Paradigm: Data-Intensive Scientific Discovery, is available for downloading (for free). The book was reviewed in Nature today. It’s written by people from Microsoft Research and has a foreword by Gordon Bell, one of the authors of Total Recall: How the E-memory Revolution Will Change Everything.

Data services

There’s been a little hiatus here as I have been traveling. I recently learned that Microsoft has launched Codename “Dallas”, a service for purchasing and managing datasets and web services. It seems they are trying to provide consistent APIs to work with different data from the public and private sectors in a clean way. There’s an introduction here.

This type of online data repository seems to be an idea whose time has arrived – I have previously talked about resources like Infochimps, Datamob and Amazon’s Public Data Sets, and there is also theinfo.org, which I seem to have forgotten to mention. A recent commenter on this blog pointed me to the comprehensive knowledge archive network, which is a “registry of open data and content packages”. Then there are the governmental and municipal data repositories, such as data.gov.

Another interesting service, which may have a slightly different focus, is Factual, described by founder Gil Elbaz as a “platform where anyone can share and mash open data“. Factual basically wants to list facts, and puts the emphasis on data accuracy, so you can express opinions on and discuss the validity of any piece of data. Factual also claims to have “deeper data technology” which allows users to explore the data in a more sophisticated way compared to other services like the Amazon Open Data Sets, for instance.

Companies specializing in helping users make sense out of massive data sets are, of course, popping up as well. I have previously written about Good Data, and now the launch of a new seemingly similar company,  Data Applied, has been announced.  Like Good Data, Data Applied offers affordable licenses for cloud-based and social data analysis, with a free trial package (though Good Data’s free version seems to offer more – a 10 MB data warehouse and 1-5 users vs Data Applied’s file size of <100 kb for a single user; someone correct me if I am wrong). The visualization capabilities of Data Applied do seem very nice. It’s still unclear to me how different the offerings of these two companies are but time will tell.

Speed of data collection

Can this quote from a new Wall Street Journal article really be true?

In fact, more technical data have been collected in the past year alone than in all previous years since science began, says Johns Hopkins astrophysicist Alexander Szalay, an authority on large data sets and their impact on science.

The article is about how to preserve and capture scientific data. This is a pressing question, as evidenced by another quote:

“Our ability to collect data now outstrips our ability to maintain it for the long run,” says William Michener at the University of New Mexico, who leads a data-preservation network called DataONE. “We lose an awful lot of data that is collected with public funds.”

An interesting point mentioned in the article is that although the advances in information technology mean that we now mostly have data which is much more suitable for preservation (electronic documents rather than hand-written notes and scribbles), it has also led to graduate students starting to communicate a lot by instant messaging, which acts as a sinkhole for a lot of information.

What is big data?

Did you know that the word data means “things given” in Latin? That’s just one of the things I learned from a very interesting (free) article, The Pathologies of Big Data by former computational neuroscientist Adam Jacobs. He also makes the perceptive comment that the word data tends to get uses as a mass noun in English, as if it denoted a substance. (After reading these interesting insights, it was no surprise to learn that Jacobs also has a degree in linguistics.)

The article discusses what “big data” really means in this day and age when we can actually keep, for instance, a dataset containing information about the entire world population in memory (not to mention on disk) on a pretty ordinary Dell server. Jacobs argues that getting stuff into databases is easy, but getting it out (in a useful form) is hard; the bottleneck lies in the analysis rather than the raw data manipulation.

He also argues that most data-processing tools, including standard relational database management systems, are not really built for the kinds of huge datasets we are starting to encounter now. Although we can in principle keep billions of rows of data in RAM, we can’t easily manipulate them using something like PostgreSQL. And other solutions like the statistics programming language R (one of my favourites) run into hard-coded memory usage limits, often about 4 GB.

A recommended read for those interested in the nerdier side of data.

On a related note, O’Reilly released a report, Big Data: Technologies and Techniques for Large-Scale Data,  in January. I haven’t read it (it costs quite a lot of money to buy the PDF), but there is a sample PDF which makes for pretty interesting reading in itself.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 46 other followers