Follow the Data

A data driven blog

Archive for the category “Emerging fields”

Deep learning and genomics?

Yesterday, I attended an excellent meetup organized by the Stockholm Machine Learning meetup group at Spotify’s headquarters. There were two presentations: First one by Pawel Herman, who gave a very good general introduction into the roots, history, present and future of deep learning, and a more applied talk by Josephine Sullivan, where she showed some impressive results obtained by her group in image recognition as detailed in a recent paper titled “CNN features off-the-shelf: An astounding baseline for recognition” [pdf]. I’m told that slides from the presentations will be posted on the meetup web page soon.

Anyway, this meetup naturally got me thinking about whether deep learning could be used for genomics in some fruitful way. At first blush it does not seem like a good match: deep learning models have an enormous number of parameters and mostly seem to be useful with a very large number of training examples (although not as many as the number of parameters perhaps). Unfortunately, the sample sizes in genomics are usually small – it’s a very small n, large p domain at least in a general sense.

I wonder whether it would make sense to throw a large number of published human gene expression data sets (microarray or RNA-seq; there should be thousands of these now) into a deep learner to see what happens. The idea would not necessarily be to create a good classification model, but rather to learn a good hierarchical representation of gene expression patterns. Both Pawel and Josephine stressed that one of the points of deep learning is to automatically learn a good multi-level data representation, such as a set of more and more abstract set of visual categories in the case of picture classification. Perhaps we could learn something about abstract transcriptional states on various levels. Or not.

There are currently two strains of genomics that I feel are especially interesting from a “big data” perspective, namely single-cell transcriptomics and metagenomics (or metatranscriptomics, metaproteomics and what have you). Perhaps deep learning could actually be a good paradigm for analyzing single-cell transcriptomics (single-cell RNA-seq) data. Some researchers are talking about generating tens of thousands of single-cell expression profiles. The semi-redundant information obtained from many similar but not identical profiles is reminiscent of the redundant visual features that deep learning methods like to consume as input (according to the talks yesterday). Maybe this type of data would fit better than the “published microarray data” idea above.

For metagenomics (or meta-X-omics), it’s harder to speculate on what a useful deep learning solution would be. I suppose one could try to feed millions or billions of bits of sequences (k-mers) to a deep learning system in the hope of learning some regularities in the data. However, it was also mentioned at the meetup that deep learning methods still have a ways to go when it comes to natural language processing, and it seems to me that DNA “words” are closer to natural language than they are to pixel data.

I suppose we will find out eventually what can be done in this field now that Google has joined the genomics party!

Compressive sensing and bioinformatics

Compressive (or compressed) sensing is (as far as I can tell; I have learned everything I know about it from the excellent Nuit Blanche blog) a relatively newly named research subject somewhere around the intersection of signal processing, linear algebra and statistical learning. Roughly speaking, it deals with problems like how to reconstruct a signal from the smallest possible number of measurements and how to complete a matrix (to fill in missing matrix elements – if you think that sounds kind of abstract, this  presentation on collaborative filtering by Erik Bernhardssons nicely explains how Spotify uses matrix completion for song recommendations; NetFlix and others also do similar things). The famous mathematician Terence Tao, who has done a lot of work in developing compressed sensing, has much better explanations than I can give, for instance this 2007 blog post that exemplifies the theory with the “single-pixel camera” and this PDF presentation where he explains how CS relates to standard linear (Ax = b) equation systems that are under-determined (more variables than measurements). Igor Carron of the Nuit Blanche blog also has a “living document” with a lot of explanations about what CS is about (probably more than you will want to refer to in a single sitting).

Essentially, compressed sensing works well when the signal is sparse in some basis (this statement will sound much clearer if you have read one of the explanations I linked above). The focus on under-determined equation systems made me think about whether this could be useful for bioinformatics, where we frequently encounter the case where we have few measurements of a lot of variables (think of, e.g., the difficulty of obtaining patient samples, and the large number of gene expression measurements that are taken from them). The question is, though, whether gene expression vectors (for example) can be thought of as sparse in some sense. Another train of thought I had is that it would be good to develop even more approximate and, yes, compressive methods for things like metagenomic sequencing, where the sheer amount of information pretty quickly starts to break the available software. (C Titus Brown is one researcher who is developing software tools to this end.)

Of course, I was far from being the first one to make the connection between compressive sensing and bioinformatics. Olgica Milenkovic had thoughtfully provided a presentation on sparse problems in bioinformatics (problems that thus could be addressable with CS techniques).

Apart from the applications outlined in the above presentation, I was excited to see a new paper about a CS approach to metagenomics:

Quikr – A method for rapid reconstruction of bacterial communities using compressed sensing

Also there are a couple of interesting earlier publications:

A computational model for compressed sensing RNAi cellular screening

Compressive sensing DNA microarrays

Follow the Data podcast, episode 3: Grokking Big Data with Paco Nathan

In this third episode of the Follow the Data podcast we talk to Paco Nathan, Data Scientist at Concurrent Inc.

Podcast link:

Paco’s blog:

The running time is about one hour.

Paco’s internet connection died just as we were about to start the podcast so he had to connect via Skype on the iPhone. We apologize on the behalf of his internet provider in Silicon Valley for the reduced sound quality caused by this.

Here’s a few links to stuff we discussed:
An application framework for Java developers to quickly and easily develop robust Data Analytics and Data Management applications on Apache Hadoop.
A dialect of Lisp that runs on the JVM.
A Scala library that makes it easy to write MapReduce jobs in Hadoop.
A simple command line interface for building large-scale data processing jobs based on Cascading.
states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency, Availability, Partition tolerance
an article on the USB-sized Oxford Nanopore MinION sequencer
Previously known as Data Without Borders this organisation aims to do good with Big Data.
Prediction based insurance for farmers. All_Watched_Over_by_Machines_of_Loving_Grace_(TV_series)
An interesting take on how programming culture has affected life. Link to episode #2 (  “The use and abuse of vegetational concepts” – about how the idea of ecosystems came to be, sprung out of the notion of harmony in nature, how this influenced cybernetics and the perils of taking this animistic concept too far.
A great way to teach kids to code.
Another interesting tool for teaching kids to code and build games.
Free form virtual reality game.
Some info on arduino-based wireless wind measurement project by Karl-Petter Åkesson (in Swedish).
A pioneering internet retailer that Paco was one of the founders for.

Games and competitions as research tools

The first high-profile paper describing crowdsourced research results has just been published in Nature. (I am excluding things like folding@home from consideration here, since in those cases the crowds are donating their processor cycles rather than their brainpower.) The paper describes how the game FoldIt (which I blogged about roughly a year ago) was used to refine predicted protein structures. This is an excerpt from the abstract:

Foldit players interact with protein structures using direct manipulation tools and user-friendly versions of algorithms from the Rosetta structure prediction methodology, while they compete and collaborate to optimize the computed energy. We show that top-ranked Foldit players excel at solving challenging structure refinement problems in which substantial backbone rearrangements are necessary to achieve the burial of hydrophobic residues. Players working collaboratively develop a rich assortment of new strategies and algorithms; unlike computational approaches, they explore not only the conformational space but also the space of possible search strategies. The integration of human visual problem-solving and strategy development capabilities with traditional computational algorithms through interactive multiplayer games is a powerful new approach to solving computationally-limited scientific problems.

So in other words, FoldIt tries to capitalize on intuitive or implicit human problem-solving skills to complement brute-force computational algorithms. Interestingly, all FoldIt players are credited as co-authors of the Nature, so technically I could count myself as one of them, seeing that I gave the game a try last year. (It’s a lot of fun, actually.)

I think games and competitions (which are almost the same thing, really) will soon be used a lot more than they are today in scientific research (and of course other areas like productivity, innovation and personal health management, too.) The Kaggle blog had an interesting post about competitions as real-time science. In a short time, Kaggle has set up several interesting prediction contests. The Eurovision Song Contest and Football World Cup contests were, I guess, mostly for fun. The interesting thing about the latter one, though, was that it was set up as a “Take on the quants” contest, where quantitative analysts from leading banks were pitted against other contestants – and they did terribly. Now the quants have a chance to redeem themselves in the INFORMS challenge, which is about their specialty area – stock price movements …

Anyway … the newest Kaggle contest is very interesting for me as a chess enthusiast. It is an attempt to improve on the age-old (well … I think it was introduced in the late 1960s) Elo rating formula, which is still used in official chess ranking lists. This system was invented by a statistician, Arpad Elo, based mostly on theoretical considerations, but it has done its job OK. The Elo ratings should ideally be able to predict results of games with a reasonable accuracy (as an aside, people have also often tried to use it to compare players from different epochs to each other, which is a futile exercise, but that’s a topic for another post), but where it really does that has not been very thoroughly analyzed. The Elo system also has some less well understood properties like an apparent “rating inflation” (which may or may not be an actual inflation). Some years ago, a statistician named Jeff Sonas started to develop his own system that he claimed was able to predict results of future games more accurately.

Now, Sonas (with Kaggle) has taken the next step, which is to arrange a competition to see if this will yield an even better system. The competitors get results of 65,000 recent games by top players and attempt to predict the outcome of a further 7,809 games. At the time of writing, there are already two rating systems that are doing better than Elo (see the leaderboard).

By the way, if you think chess research is not serious enough, Kaggle also has a contest about predicting HIV progression. I’m sure they have other scientific prediction contests lined up (I’ve noticed a couple of interesting – and lucrative – ones at Innocentive too.)

Data-driven venture capitalists and more

Via Bradford Cross’ excellent post on data-driven startups (he has one himself – FlightCaster, a flight-delay prediction service that I mentioned last year), I learned the interesting fact that there is now at least one venture capital company that specializes exclusively in data-driven or “big data” startups. This company is IA Ventures, and it “invests in companies that create tools to manage and extract value from massive, occasionally unstructured, often real-time data sets“. I particularly like this sentence from their web page: “Most data generated today is simply treated as exhaust—lost forever along with the valuable insights held in it.” This is very true, and there are sure to be enormous opportunities for those who are clever enough to turn this “exhaust” – in the form of structured or unstructured data – into a product. The above-mentioned post by Bradford Cross tries to suggest some public data sets that might be leveraged by a savvy startup.

One nice example of a company that uses seemingly mundane information – cab pickup frequencies in New York City – to create a useful product is Sense Networks. They perform “some heavy-duty data crunching” on information from taxi companies and mobile phone records to predict the best places to get a cab in NYC. The predictor is implemented as an iPhone application called CabSense.  In a recent podcast named Reality Mining for Companies, Alex “Sandy” Pentland, a professor who is also on Sense Networks’ management team, describes how even more trivial information like movement patterns of individuals inside a company can actually be analyzed to improve productivity and working conditions. Did you know that productivity goes up 10% if you have coffee with a cohesive group of co-workers?

Anyway, it will be interesting to see how the data-driven startups funded by IA Ventures turn out. One of them, Recorded Futures, has also recently received funding from Google. This company is based in US and Sweden, and one of the people behind it is Christopher Ahlberg, the founder of Spotfire (a successful analytics company which was built around a user-friendly visualization tool and sold to Tibco a couple of years ago). Recorded Futures attempts to predict future events (!) by analyzing and indexing various sources (news, analysis pieces, prognoses etc.) on the web. I assume they use some sort of natural language processing to recognize entities (like names of people and companies, dates etc.) and infer relationships between them from indexed reports. The company’s blog has some interesting visualizations that summarize, for example, the lives of some terrorist suspects who have recently been in the news. My favorite entry in the blog (if only for its name) is “Has Hu Jintao’s behavior changed?” These blog case studies do not contain predictions of future events, but rather a kind of proof of concept that the system can reconstruct a reasonable timeline showing important events in a person’s (or maybe a company’s) life and display it in an effective way. I did register for a couple of “Futures“, an email based service where you get alerts about possible future events connected to a set of keywords, but the only prediction I have received so far was apparently based on some faulty date recognition.

In case you read Swedish (or are able to tolerate Google translations), the best summary I have found of what is currently known about Recorded Futures is at the Cornucopia blog.

Peer-reviewed life?

For those curious about where self-tracking (or self-measurements/self-monitoring/personal informatics, or whatever we should call it) might be going in the future, it could be worth glancing through the papers from an interesting workshop, Know Thyself: Monitoring and Reflecting on Facets of One’s Life, which was held in Atlanta in April. The papers have intriguing titles like Life-browsing with a Lifetime of Email, Computational Models of Reflection, Collaborative Capturing of Significant Life Memories and From Personal Health Informatics to Health Self-management. A striking quote from a paper entitled Assisted Self Reflection: Combining Lifetracking, Sensemaking, & Personal Information Management by Moore et al:

Just as we are able to submit papers to peer-reviewed con-
ferences and journals, we could anonymously share selected
portions of our life activities for peer or professional consulta-
tion when making major career decisions, learning a new skill
or in the process of recovery. By seeing ourselves through
the eyes of others, we are more able to normalize behavior
patters and raise awareness of suppressed abnormalities.

I’m not sure I am ready for peer review yet … maybe some day…

Stream computing for babies

A Smarter Planet has a nice video about how IBM have used stream computing (basically meaning, I think, real-time analysis of massive streams of unstructured data) to improve the detection of life-threatening complications in prematurely born babies. Doctors at the The Hospital for Sick Children in Toronto wanted to try to use real-time information to detect changes in the condition of critically ill “preemies”. They set up a system to measure streams of physiological data about e g respiration and heart rate and analyze them on the fly. In a cute comparison, the speaker voice says that the IBM InfoSphere “…enables massive amounts of data to be correlated and analyzed for patterns and trend at more than 200 times a second, faster than a hummingbird flaps its wings.”

A very nice application of stream analytics – and as a bonus, the video uses Terry Riley’s A Rainbow in Curved Air as part of its soundtrack (I think).

Computational advertising course

I’ve written about one company that exemplifies how advertising is becoming more data-driven, and now I find there is a Stanford university course about computational advertising. One of the lecture note PDFs defines computational advertising as “A principled way to find the ‘best match’ between a user in a context and a suitable ad“. Although I agree with this O’Reilly Radar blog post in thinking that it’s a stretch to call computational advertising a “scientific discipline”, the lecture notes are nevertheless fun and interesting to read. The instructors are from Yahoo! Research and probably a lot of the material that they cover is actually being used by Yahoo! in some way.

Far-out stuff

Some science fiction-type nuggets from the past few weeks:

Google does machine learning using quantum computing. Apparently, a “quantum algorithm” called Grover’s algorithm can search an unsorted database in O(√N) time. The Google blog explains this in layman’s terms:

Assume I hide a ball in a cabinet with a million drawers. How many drawers do you have to open to find the ball? Sometimes you may get lucky and find the ball in the first few drawers but at other times you have to inspect almost all of them. So on average it will take you 500,000 peeks to find the ball. Now a quantum computer can perform such a search looking only into 1000 drawers.

I’ve absolutely no clue how this algorithm works – although I did take an introductory course in quantum mechanics many a moon ago, I’ve forgotten everything about it and the course probably didn’t go deep enough to explain it anyway. Google are apparently collaborating with a Canadian company called D-Wave, who develop hardware for realizing something called a “quantum adiabatic algorithm” by “magnetically coupling superconducting loops”. It is interesting that D-Wave are explicitly focusing on machine learning; the home page states that “D-Wave is pioneering the development of a new class of high-performance computing system designed to solve complex search and optimization problems, with an initial emphasis on synthetic intelligence and machine learning applications.”

Speaking of synthetic intelligence, the winter issue of H+ Magazine contains an article by Ben Goertzel where he discusses the possibility that the first artificial general intelligence will arise in China. The well-known AI researcher Hugo de Garis, who runs a lab in Xiamen in China, certainly believes that this will happen. In his words:

China has a population of 1.3 billion. The US has a population of 0.3 billion. China has averaged an economic growth rate of about 10% over the past 3 decades. The US has averaged 3%. The Chinese government is strongly committed to heavy investment into high tech. From the above premises, one can virtually prove, as in a mathematical theorem, that China in a decade or so will be in a superior position to offer top salaries (in the rich Southeastern cities) to creative, brilliant Westerners to come to China to build artificial brains — much more than will be offered by the US and Europe. With the planet‘s most creative AI researchers in China, it is then almost certain that the planet‘s first artificial intellect to be built will have Chinese characteristics.

Some other arguments in favor of this idea mentioned in the article are that “One of China‘s major advantages is the lack of strong skepticism about AGI resulting from past failures” and that China “has little of the West‘s subliminal resistance to thinking machines or immortal people”.

(By the way, the same issue contains a good article by Alexandra Carmichael on subjects frequently discussed on this blog. The most fascinating detail from that article, to me, was when she mentions “self-organized clinical trials“; apparently users of PatientsLikeMe with ALS had set up their own virtual clinical trial where some of them started to take lithium and some didn’t, after which the outcomes were compared.)

Finally, I thought this methodology for tagging images with your mind was pretty neat. This particular type of mind reading does not seem to have reached a high specificity and sensitivity yet, but that will improve in time.

Predictive policing

While listening to my backlog of the BBC Arts and Ideas podcast, I stumbled into a discussion of predictive policing in an interview with William Bratton, Chief of the Los Angeles Police Department and former Chief of the NYPD (he is said to have come up with the “zero tolerance” concept). The podcast is here (mp3 link); the predictive policing discussion is toward the end. Of course, this concept evokes something out of Philip K. Dick (“Minority Report”). Bratton has been involved in something called COMPSTAT, “the internationally acclaimed command accountability system that uses computer-mapping technology and timely crime analysis to target emerging crime patterns and coordinate police response.” He calls for enhanced wireless broadband capabilities for public safety, and claims that predictive policing (well, he actually says “utilizing technology”) has so far prevented 300 homicides in Los Angeles, corresponding to a net positive economic impact of 1.2 billion USD (yes, a homicide has a negative impact of 4 million USD!)

Bratton talks about “real-time crime centers” and “hot spot policing” where emerging patterns or trends can be detected early and the area in question can be flooded with police resources. Somewhat in analogy to current healthcare trends, the focus is moving to prevention of crime rather than response to crime.

Here is an interesting article about predictive analytics in policing.

Post Navigation