Follow the Data

A data driven blog

Archive for the tag “open-data”

Swedish school fires and Kaggle open data

For quite a while now, I have been rather mystified and intrigued by the fact that Sweden has one of the highest rates of school fires due to arson. According to the Division of Fire Safety Engineering at Lund University, “Almost every day between one and two school fires occur in Sweden. In most cases arson is the cause of the fire.” This is a lot for a small country with less than 10 million inhabitants, and the associated costs can be up to a billion SEK (around 120 million USD) per year.

It would be hard to find a suitable dataset to address the question why arson school fires are so frequent in Sweden compared to other countries in a data-driven way – but perhaps it would be possible to stay within a Swedish context and find out which properties and indicators of Swedish towns (municipalities, to be exact) might be related to a high frequency of school fires?

To answer this question, I  collected data on school fire cases in Sweden between 1998 and 2014 through a web site with official statistics from the Swedish Civil Contingencies Agency. As there was no API to allow easy programmatic access to schools fire data, I collected them by a quasi-manual process, downloading XLSX report generated from the database year by year, after which I joined these with an R script into a single table of school fire cases where the suspected reason was arson. (see Github link below for full details!)

To complement  these data, I used a list of municipal KPI:s (key performance indicators) from 2014, that Johan Dahlberg put together for our contribution in Hack for Sweden earlier this year. These KPIs were extracted from Kolada (a database of Swedish municipality and county council statistics) by repeatedly querying its API.

There is a Github repo containing all the data and detailed information on how it was extracted.

The open Kaggle dataset lives at https://www.kaggle.com/mikaelhuss/swedish-school-fires. So far, the process of uploading and describing the data has been smooth. I’ve learned that each Kaggle dataset has an associated discussion forum, and (potentially) a bunch of “kernels”, which are analysis scripts or notebooks in Python, R or Julia. I hope that other people will contribute script and analyses based on these data. Please do if you find this dataset intriguing!

Data hype!

There’s a lot of talk about data now, and it seems to be accelerating. Just during the last few weeks, I’ve been told that:

  • data is money (and money is data). Quote: “In the new data economy, money is just a certain type of data. Nothing more and nothing less.” [Stephan Noller, The Future of Targeting]
  • the data singularity is here. Quote: “The machines all around us — our smart phones, smart cars, and fee-happy bank accounts — are talking, and increasingly we’re being left out of the conversation.” [Michael Driscoll, Dataspora]
  • (big) data is less about size, more about freedom. Quote: “[T]he data renaissance is here. Be a part of it.” [Bradford Cross for TechCrunch]
  • (big) data is the next big thing. Quote: “This is not last year’s data-mining.  This is data-mining on steroids!” [Jonathan Reichental, PwC Innovation Blog]
  • (open) data went worldwide in 2009. Quote: “The cry of “raw data now,” which I made people make in the auditorium,was heard around the world.” [Tim Berners-Lee]
  • open data and cloud data services are at least as important as open source. Quote (actually from Peter Norvig): “We [=Google] don’t have better algorithms than anyone else. We just have more data.” [Tim O’Reilly]
  • the future belongs to companies who combine public data in the right way and offer analytics-based insights. Quote: “How did Flightcaster, a one time Y Combinator startup, put itself in a position to know more about the state of airline operations than the airlines themselves?” [tecosystems]

Of course, we shouldn’t forget IBM’s data baby, who’s out there generating sensor data even before s(he) has been born. A citizen of the future.

1.2 zettabyte of data

OK, so I was a bit slow to discover this, but The Economist has a special report on big data which is freely available online. That is, the individual articles are free, and a PDF compiling them is supposed to cost 3 GBP, but I was able to download it for free here without doing anything special.

A fun fact that I learned from this report is that the total amount of information in the world this year is projected to reach 1.2 zb (zettabyte) – which is 1.2×10^21 byte. How on earth did they come up with that figure…? Anyway, this report is worth a read, as it touches on things like business analytics, web mining, open government data and augmented cognition, while also giving some well deserved love to R and open source software.

Link roundup

Here are some interesting links from the past few weeks (or in some cases, months). I’m toying with the idea of just tweeting most of the links I find in the future and reserving the blog for more in-depth ruminations. We’ll see how it turns out. Anyway … here are some links!

Open Data

The collaborative filtering news site Reddit has introduced a new Open Data category.

Following the example of New York and San Francisco (among others), London will launch an open data platform, the London Data Store.

Personal informatics and medicine

Quantified Self has a growing (and open/editable) list of self-tracking and related resources. Notable among those is Personal Informatics, which itself tracks a number of resources – I like the term personal informatics and the site looks slick.

Nicholas Felton’s Annual Report 2009. “Each day in 2009, I asked every person with whom I had a meaningful encounter to submit a record of this meeting through an online survey. These reports form the heart of the 2009 Annual Report.” Amazing guy.

What can I do with my personal genome? A slide show by LaBlogga of Broader Perspectives.

David Ewing Duncan, “the experimental man“, has read Francis Collins’ new book about the future of personalized medicine (Language of Life: DNA and the Revolution in Personalized Medicine­) and written a rather lukewarm review about it.

Duncan himself is involved in a very cool experiment (again) – the company Cellular Dynamics International has promised to grow him some personalized heart cells. Say what? Well, basically, they are going to take blood cells from him, “re-program” them back to stem-cell like cells (induced pluripotent cells), and make those differentiate into heart cells. These will of course be a perfect genetic match for him.

Duncan has also put information about his SNPs (single-nucleotide polymorphisms; basically variable DNA positions that  differ from person to person) online for anyone to view, and promises to make 2010 the year when he tries to make sense of all the data, including SNP information, that he obtained about his body when he was writing his book Experimental Man. As he puts it, “Producing huge piles of DNA for less money is exciting, but it’s time to move to the next step: to discover what all of this means.”

HolGenTech – a smartphone based system for scanning barcodes of products and matching them to your genome (!) – that is, it can tell you to avoid some products if you have had a genome scan which found you have a genetic predisposition to react badly to certain substances. I don’t think that the marketing video done in a very responsible way (it says that the system: “makes all the optimal choices for your health and well being every time you shop for your genome“, but this is simply not true – we know too little about genomic risk factors to be able to make any kind of “optimal” choices), but I had to mention it.

The genome they use in the above presentation belongs to the journalist Boonsri Dickinson. Here are some interviews she recently did with Esther Dyson and Leroy Hood, on personalized medicine and systems biology, respectively, at the Personalized Medicine World Conference in January.

Online calculators for cancer outcome and general lifestyle advice. These are very much in the spirit of The Decision Tree blog, through which I in fact found these calculators.

Data mining

Microsoft has patented a system for “Personal Data Mining”. It is pretty heavy reading and I know too little about patents to able to tell how much this would actually prevent anyone from doing various types of recommendation systems and personal data mining tools in the future; probably not to any significant extent?

OKCupid has a fun analysis about various characteristics of profile pictures and how they correlate to online dating success. They mined over 7000 user profiles and associated images. Of course there are numerous caveats in the data interpretation and these are discussed in the comments; still good fun.

A microgaming network has tried to curb data mining of their poker data. Among other things, bulk downloading of hand histories will be made impossible.

Data services

There’s been a little hiatus here as I have been traveling. I recently learned that Microsoft has launched Codename “Dallas”, a service for purchasing and managing datasets and web services. It seems they are trying to provide consistent APIs to work with different data from the public and private sectors in a clean way. There’s an introduction here.

This type of online data repository seems to be an idea whose time has arrived – I have previously talked about resources like Infochimps, Datamob and Amazon’s Public Data Sets, and there is also theinfo.org, which I seem to have forgotten to mention. A recent commenter on this blog pointed me to the comprehensive knowledge archive network, which is a “registry of open data and content packages”. Then there are the governmental and municipal data repositories, such as data.gov.

Another interesting service, which may have a slightly different focus, is Factual, described by founder Gil Elbaz as a “platform where anyone can share and mash open data“. Factual basically wants to list facts, and puts the emphasis on data accuracy, so you can express opinions on and discuss the validity of any piece of data. Factual also claims to have “deeper data technology” which allows users to explore the data in a more sophisticated way compared to other services like the Amazon Open Data Sets, for instance.

Companies specializing in helping users make sense out of massive data sets are, of course, popping up as well. I have previously written about Good Data, and now the launch of a new seemingly similar company,  Data Applied, has been announced.  Like Good Data, Data Applied offers affordable licenses for cloud-based and social data analysis, with a free trial package (though Good Data’s free version seems to offer more – a 10 MB data warehouse and 1-5 users vs Data Applied’s file size of <100 kb for a single user; someone correct me if I am wrong). The visualization capabilities of Data Applied do seem very nice. It’s still unclear to me how different the offerings of these two companies are but time will tell.

Open government and municipal data

Several projects having to do with open-access data from governmental and municipal sources have been announced in the past few months. Governments are, of course, already using analytics for a lot of things – this post from A Smarter Planet talks about how the Social Security Administration in the US is “using analytics and predictive modeling to make quicker determinations on disability applications for those in need” and how the US Postal Service is “extracting valuable insights from information on mail delivery to improve on-time delivery performance” (a postal service that improves – now that’s a novelty!), but now (part of) these data will be available to anyone.

In the US, federal data are or will be made available at data.gov. The site seems pretty well-designed, with data sets being queriable online and raw data being downloadable in XML, comma-separated file and other formats. Among the data sets at data.gov, there are some interesting health-related ones like the National Cancer Institute’s Surveillance, Epidemiology and End Results (SEER) database.

The City of San Francisco has its own data clearinghouse, dataSF. Strongly supported by mayor Gavin Newsom, it was launched in August. Given the huge population of programmers in SF, it’s no surprise that mash-up applications using the data started to appear quickly, e.g. “Routesy, which offers directions based on real-time city transport feeds; and EcoFinder, which points you to the nearest recycling site for a given item” (see this Guardian article for more.) New York City has its own NYC Data Mine, which has come under some criticism. It is hosting the BigApps competition, which is meant to “stimulate innovation in the information technology and media industries, and attract and support developer talent to develop web and mobile applications (apps) by using City data.”

The real pioneer in open government data, though, seems to be not the US but Australia. The Australian government recently arranged a “hack day” named GovHack, where they invited programmers to develop mashups and applications around government data during 24h intense hours. Some of the projects that came out of this event were:

The overall winners LobbyClue, by a team comprising members many of whom had never met before the event. LobbyClue is an in-depth visualisation of lobbying groups’ relations to government agencies, including tenders awarded, links between the various agencies, and physical office locations

It’s buggered, mate, In true Australian style, allows you to report buggered toilets, roads, etc, with an easy-to-use graphical interface overlayed on a map. Their idea was to combine this with local government services to fix issues in the community. Built by a team of developers from Lonely Planet.

Know where you live, a stylish presentation of ABS data (along with Flickr Geocoded photos), pulling in relevant information for a particular postcode: rental rates, average income, crime rates, and more. Built by a team of developers who work at News Digital Media.

Pretty cool, isn’t it?

Post Navigation