So where are all these huge data sets that I (and others) have been talking about? Well, some of them are freely available for download. For example, the extensive Reality Mining data set from MIT (which I have blogged about) is available as a mySQL database for anyone to play around with.
There are a couple of repositories for data sets. Infochimps has hundreds or probably thousands of data sets from a wide variety of sources. Some of the data is directly downloadable from the site, while other data sets are just pointed to. Datamob is a similar, though smaller, resource. Amazon’s Public Data Sets are meant to be used seamlessly from within Amazon’s cloud computing applications, like the Elastic Compute Clusters (EC2). Here, we find massive datasets such as the collection of all publicly available DNA sequences from GenBank.
Peter Skomoroch has a del.icio.us tag for datasets which is probably the most extensive reference for big downloadable data out there (and which makes this blog post rather superfluous …) Due to the magic of del.icio.us, this list is of course dynamic and continuosly growing.
Finally, programmableweb is perhaps not strictly about data per se, but provides links to known APIs for access to web-based resources through your own programs.