Topology and data analysis: Gunnar Carlsson and Ayasdi
A few months ago, I read in Wired [Data-Visualization Firm’s New Software Autonomously Finds Abstract Connections] and Guardian [New big data firm to pioneer topological data analysis] about Ayasdi, the new data visualization & analytics company founded by professor Gunnar Carlsson at Stanford that has received millions of funding from Khosla Ventures, DARPA and other places. Today, I had the opportunity to hear Carlsson speak at the Royal Institute of Technology (KTH) in Stockholm about the mathematics underlying Ayasdi’s tools. I was very eager to hear how topology (Carlsson’s specialty) connects to data visualization, and about their reported success in classifying tumor samples from patients [Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival].
The talk was very nice. Actually there were two talks – one for a more “general audience” (which, in truth, probably consisted mostly of hardcore maths or data geeks) and one that went much deeper into the mathematics – and completely over my head.
One thing that intrigued me was that almost all of his examples were taken from biology: disease progression, cell cycle gene expression, RNA hairpin folding, copy number variations, mapping genotypes to geographic regions … the list goes on. I suppose it’s simply because Carlsson happens to work a lot with biologists at Stanford, but could it be that biology is an especially fertile area for innovation in data analysis? As Carlsson highlights in a review paper [Topology and Data], data sets from high-throughput biology often have many dimensions that are superfluous or have unknown significance, and there is no good understanding of which distance measures between data points that should be used. That is one reason to consider methods that do not break down when the scale is changed – such as topological methods, which are insensitive to the actual choice of metrics.
(Incidentally, I think there is a potential blog post waiting to be written about how many famous data scientists have come out of a biology/bioinformatics background. Names like Pete Skomoroch, Toby Segaran and Michael Driscoll come to mind, and those were just the ones I thought of instantly.)
Another nice aspect of the talk was that it was no sales pitch for Ayasdi (the company was hardly even mentioned) but more of a bird’s eye view of topology and its relation to clustering and visualization. In my (over)simplified understanding, the methods presented represent the data as a network where the nodes, which are supposed to represent “connected components” in an ideal scenario, are clusters derived using, e.g., hierarchical clustering. However, there is no cutoff value defined for breaking the data into clusters, but instead the whole outcome of the clustering – its profile, so to speak – is encoded and used in the following analysis. Carlsson mentioned that one of the points of this sort of network representation of data was to “avoid breaking things apart”, as clustering algorithms do. He talked about classifying the data point clouds using “barcodes”, ensuring persistence across changes of scale. The details of how these barcodes were calculated were beyond my comprehension.
Carlsson showed some examples of how visualizations created using his methods improved on hierarchical clustering dendrograms or PCA/MDS plots. He said that one of the advantages of the method is that it can identify small features in a large data set. These features would, he said, be “washed out” in PCA and not be picked up by standard clustering methods.
I look forward to learning more about topological data analysis. There are some (links to) papers at Ayasdi’s web site if you are interested, e.g.: Extracting insights from the shape of complex data using topology, Topology and Data (already mentioned above), Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition.