Research Threads

Currently, I'm most interested in (statistical) machine translation (SMT), maximum entropy approaches to language processing and more recently, doing things with massive streams of Social Media (eg Blogs, Tweets, Prediction Markets).

Did you know that using Blog posts to predict stock prices, Google is a better predictor of Yahoo! stock price than Yahoo itself? Or, Tweets are more useful for modelling peoples' collective belief of an imminent Swine Flu' epidemic than Blogs.

Related to these broad areas is the question of how to train and apply large models. For example, our machine translation systems need to run on a cluster of machines, as do our CRFs. Throwing more machines at such models is a quick fix, but it is clear that the most interesting models and datasets will make computational demands which far outstrip whatever resources we have available. Scaling our machine learning methods will become crucial. Randomised algorithms will prove essential here and our work using Bloom Filters for Language Models is a start in this direction. Additionally, infrastructure to support large-scale experiments is vital. I have been installing and playing around with Hadoop as a fun way to do this.

My list of papers is divided by topic and these roughly correlate with my research. I'm always happy to take on one or two new PhD students each year.