Follow the Data

A data driven blog

Archive for the month “November, 2009”

Open government and municipal data

Several projects having to do with open-access data from governmental and municipal sources have been announced in the past few months. Governments are, of course, already using analytics for a lot of things – this post from A Smarter Planet talks about how the Social Security Administration in the US is “using analytics and predictive modeling to make quicker determinations on disability applications for those in need” and how the US Postal Service is “extracting valuable insights from information on mail delivery to improve on-time delivery performance” (a postal service that improves – now that’s a novelty!), but now (part of) these data will be available to anyone.

In the US, federal data are or will be made available at data.gov. The site seems pretty well-designed, with data sets being queriable online and raw data being downloadable in XML, comma-separated file and other formats. Among the data sets at data.gov, there are some interesting health-related ones like the National Cancer Institute’s Surveillance, Epidemiology and End Results (SEER) database.

The City of San Francisco has its own data clearinghouse, dataSF. Strongly supported by mayor Gavin Newsom, it was launched in August. Given the huge population of programmers in SF, it’s no surprise that mash-up applications using the data started to appear quickly, e.g. “Routesy, which offers directions based on real-time city transport feeds; and EcoFinder, which points you to the nearest recycling site for a given item” (see this Guardian article for more.) New York City has its own NYC Data Mine, which has come under some criticism. It is hosting the BigApps competition, which is meant to “stimulate innovation in the information technology and media industries, and attract and support developer talent to develop web and mobile applications (apps) by using City data.”

The real pioneer in open government data, though, seems to be not the US but Australia. The Australian government recently arranged a “hack day” named GovHack, where they invited programmers to develop mashups and applications around government data during 24h intense hours. Some of the projects that came out of this event were:

The overall winners LobbyClue, by a team comprising members many of whom had never met before the event. LobbyClue is an in-depth visualisation of lobbying groups’ relations to government agencies, including tenders awarded, links between the various agencies, and physical office locations

It’s buggered, mate, In true Australian style, allows you to report buggered toilets, roads, etc, with an easy-to-use graphical interface overlayed on a map. Their idea was to combine this with local government services to fix issues in the community. Built by a team of developers from Lonely Planet.

Know where you live, a stylish presentation of ABS data (along with Flickr Geocoded photos), pulling in relevant information for a particular postcode: rental rates, average income, crime rates, and more. Built by a team of developers who work at News Digital Media.

Pretty cool, isn’t it?

Link roundup

A roundup of some interesting links from the past few weeks.
Brian Mossop of the Decision Tree blog is embarking on a project to find out how much personal data is needed to stay healthy. He will use devices like the Zeo sleep coach and the Nike+ sportband to record his personal data and post updates about what he has found. He’s also promised a longer blog post after 30 days summarizing his experiences.

dnaSnips is a site that compares reports received by the same person from three different direct-to-consumer (DTC) genetic analysis services:  23andme, deCODEme and Navigenics. Summarizing the experiment, the author feels that all three services give pretty accurate results. (link found via Anthony Fejes’ blog)

By now, there are probably few people who haven’t heard about the scientist who turned out to be call girl blogger Belle de Jour. I was intrigued to find that her Amazon wishlist contains hardcore statistical data analysis books like Chris Bishop’s Neural Networks for Pattern Recognition and An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. In fact, she’s previously blogged about her Bayesian theory of relationships, so she’s clearly no slouch when it comes to statistics and data mining.

The London data cloud

Now that’s what I call data visualization … The city of London is planning to erect a “digital cloud” in its Olympic village before 2012. The cloud would be made up of interconnected plastic bubbles that would float above the city and display different kinds of data: sports results, weather measurements, traffic data and so on. The team behind this Buckminster Fuller-esque project includes people from Google (of course) and MIT (ditto), and, perhaps unexpectedly, author and semiotician Umberto Eco. The home page is visionary (although I thought the narrator of the official video was rather uninspiring) – it talks about “Code rather than Carbon” and “a space alive to the touch, an aerial ecology“. The Cloud is supposed to be self-sufficient in terms of energy, with a zero energy footprint – the people ascending into it will provide energy when they descend and the rest will be provided by solar panels.

Another quote from the official site:

Like all tell-tale signs of brooding weather, the Cloud is a display system. It is both screen and barometer, archive and sensor, past and future. The patterns of its animated skins offer a civic-scale smart-meter for London as a whole, sign-posting particular events, transport patterns, weather forecasts, timetables, and footage either real-time or decades old.

Mobile phone diagnosis

An application I fantasized about in a previous blog post, namely a mobile phone application for early detection of depression, is being developed by Cogito Health, a spin-off company from MIT. The company’s algorithms can use speech features like the tone and pitch of a person’s voice, the length and frequency of pauses and speed of speech to detect mood disorders. CEO Joshua Feast says that “…voice analysis software could provide a natural and noninvasive way for nurses to screen for depression during routine phone calls.”

Not only depression but also specific types of coughing can now be detected by phones. According to the article, coughs can be surprisingly complex, but “…even with a limited amount of data, scientists can distinguish between a healthy, voluntary cough and the involuntary cough of a sick person. Healthy people have slightly louder coughs, about 2 percent louder than a sick person.” The name of the application is kind of funny … iCough. If you have iCough installed on your phone, your doctor can ask you to cough into the phone, after which the “…sound can be run through the computer, compared to all known cough profiles, and a diagnosis can be confirmed in a few seconds“.

An even more surprising way of using a mobile phone for diagnostic purposes is to turn it into a microscope. A company Microskia is commercializing technology for low-cost cell imaging developed at the California NanoSystems Institute. According to the inventors, phones modified using cheap off-the shelf hardware plus a special piece of software can detect, for example, “…the asymmetric shape of diseased blood cells or other abnormal cells, or note an increase of white blood cells.” The new technology may prove to be especially helpful for screening for malaria, according to the article.

Existential computing

How cool is this course, called “The Rest of You” and taught at New York University? It was mentioned in a recent blog post at The Quantified Self, which also links to a video of teacher Dan O’Sullivan talking about it.

The Rest of You course is about building tools to quantify your experiences in everyday life, with a special emphasis on unconscious and less intentional things – for example things that are controlled by the autonomic nervous system, like galvanic skin response (which has to do with e g fear, anger and sexual arousal) and breathing. As mentioned in the QS blog, a husband and wife team measured their galvanic skin responses while watching a movie, and compared the readouts afterwards. Mostly the responses were similar, but there were many times where one of them had a strong response while the other reacted weakly if at all.

The syllabus includes questions/assignments/material like:

  • What was your day really like?  Get an objective picture of your day using light, gravity, sound, image, temperature.
  • How are you really feeling? Get reading from unconsciously controlled reactions sweat, breath, temperature, electical, posture, heart, sound, subliminal input,eeg
  • Graphing data in Processing or using Flowing Data , SensorBase, Pachube
  • Using batteries, small microcontrollers, how to make the devices fit on your body, keylogging, and how to get data from a phone
  • Reading about flow and mirror neurons

It sounds excellent already in theory, but looking at some of the students’ blogs really drives home how cool it is. For example, John Kuiphoff wired himself up and devised an experiment for quantifying how well wrist braces (which he got for his carpal tunnel syndrome) stabilize movements during typing. He also did an interesting experiment about how well people can distinguish subtle variations in color. Elizabeth Fuller fitted her cocktail dress out with a proximity sensor and an accelerometer and sent data from the dress onto a computer during a party.

Apart from the tech/data aspects of the whole thing, I like Dan O’Sullivan’s idea about “existential computing”, as he calls it – to use these tools to realize that our conscious experience is actually just a small slice of the sum total of what we go through. The writing assignments pose tough questions about illusions and happiness: What are the some illusions in my existence? How do they affect your happiness? Can new technologies correct for these illusions? Can gaining insights with a more complete view of your existence improve your life?  Can it make society better?

What do you do with a personal genome?

Now that the full sequencing of a person’s genome can be done for well below USD10,000 – Complete Genomics recently announced having sequenced three genomes for consumables costs between $1,726 and $8,005 – the question is what you would be able to do, today, with information about your genome.

Personalized Medicine recently published an article, Living with my personal genome by Jim Watson (co-discoverer of the structure of DNA.) The article is very short but it does tell us that Watson has changed his behavior in at least one way: he now takes beta-blockers only once a week instead of every day, because he discovered that he has an enzyme variant which causes him to metabolize the drug slowly, making him “…constantly fall asleep at inappropriate moments.” Apparently it took a whole-genome scan to realize that was abnormal!

Quantified Self has reported on its third New York Show & Tell session, where Esther Dyson, who also has had her genome sequenced, discussed what she had found out (video here). However, rather than the full genome sequence (which she calls “disappointing” in the beginning of the talk, saying that “it tells me nothing, I can’t interpret it” – if you think you could interpret it better, it’s online here), she focuses on her report from 23andme, which records information about a million SNPs (single-letter variations in the DNA) in each individual. She shows some rather nifty tools like the Relative Finder, which can be used to identify potential cousins.

Another early whole-genome sequencee, Steven Pinker, wrote a long and thoughtful article about his genome a while back in New York Times. Definitely worth a read.

Mining data streams, the web, and the climate

I recently came across MOA (Massive Online Analysis), an environment for what its developers call massive data mining, or data stream mining. This New Zealand-based project is related to Weka, a Java-based framework for machine learning which I’ve used quite a bit over the years. Data stream mining differs from plain old data mining in that the data is assumed to arrive quickly and continuously, as in a stream, and in an unpredictable order. Therefore the full data set will typically be many times larger than your computer’s memory (which already rules out some commonly used algorithms), and each example can only be briefly examined once, after which it is discarded. Therefore the statistical model has to be updated incrementally, and often must be ready to be applied at any point between training examples.

I also came across a press release describing version 2.0 of KnowledgeMiner for Excel, a data mining software apparently used by customers like Pfizer, NASA and Boeing, and which is based on GMDH (Group Method of Data Handling), a paradigm I hadn’t heard about before. I failed to install KnowledgeMiner for Excel for my Mac due to some obscure install error, but from what I gather, the GMDH framework involves a kind of automatic model selection, making it easier to use for non-experts in data mining. (Of course I haven’t tried it, so it’s hard to evaluate the claim.) The example data set provided with the software package has to do with climate data and modeling, so it should be fun to try as soon as I get it working:

The new KnowledgeMiner is now capable of high-dimensional modeling and prediction of climate and has an included example using air and sea surface temperature data. This is a first for a data-mining software package: to offer anyone the ability to see for themselves that global temperatures are rising steadily, using publicly available data. The biggest surprise is seeing that the changes are greatest and accelerating in the northern latitudes. By using data from the past, KnowledgeMiner (yX) can show predictions for future years. Go to this link to see the climate change data displayed graphically in a slideshow through the year 2020:

There’s also an interesting new toolkit for web mining from BixoLabs. They’ve built what they call an elastic web mining platform in Amazon’s Elastic Compute Cloud (on top of Hadoop, Cascading and a web mining framework called Bixo, for those of you who care). The whole thing is pre-configured and scalable, and from the tutorials on the site, it seems pretty easy to set it up to crawl the web to your heart’s content.

Informavores

There’s a pretty interesting interview with a German thinker called Frank Schirrmacher, and comments on that interview, at edge.org. (I like this format – it’s a bit like those new online scientific journals where you can read the reviewers’ comments to the authors.) Schirrmacher talks about the concept of informavores,

…the human being as somebody eating information. So you can, in a way, see that the Internet and that the information overload we are faced with at this very moment has a lot to do with food chains, has a lot to do with food you take or not to take, with food which has many calories and doesn’t do you any good, and with food that is very healthy and is good for you.

He has some interesting thought on “dislocated” thought and the concept of free will …

…thinking itself somehow leaves the brain and uses a platform outside of the human body. And that, of course, is the Internet and it’s the cloud. Very soon we will have the brain in the cloud. And the raises the question about the importance of thoughts. For centuries, what was important for me was decided in my brain. But now, apparently, it will be decided somewhere else.

… and prediction:

What will this mean for the question of free will? Because, in the bottom line, there are, of course, algorithms, who analyze or who calculate certain predictabilities. And I’m wondering if the comfort of free will or not free will would be a very, very tough issue of the future.

[...]

The way we predict our own life, the way we are predicted by others, through the cloud, through the way we are linked to the Internet, will be matters that impact every aspect of our lives.

The interview is worth reading in full, as are the comments. I actually agree with many of the commenters who criticize Schirrmacher’s views, but the debate is interesting and he definitely has some novel ideas.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 46 other followers