At the BBC newsHACK the team I was part of built a tool which aligns news reports across languages, so that users reading an article online could select a language from a drop-down box and be presented with a near-parallel article in the language of their choice.
It's important to note that this isn't machine translation, which would be one (crummy) way to get a similar effect, but rather alignment of the articles based on some corresponding features. This is useful for users, but could also be quite useful for the News Labs team themselves -- a corpus of aligned news reports would allow them to develop some fancier tools and introspection on the difference between articles.
For each article, our process was (roughly):
This post is my list of areas where I think we could improve the system.
The current approach for language ID and MT is quite slow, as we're making calls to Google's translation service over the web. We should look at software approaches for MT which we could run internally -- existing libraries or tools which are as good as Translate.
The set of features is rather small at the moment. We could perhaps improve this by looking at URLs from the article body, or re-visiting the image and video comparison idea. Additionally, we could expand/improve the textual similarity comparison by using the multilingual semantic tagger -- if this works well enough, we could even drop the reliance on machine translation entirely.
Working with some training data, or even just with un-annotated data for longer, would allow us to get some proper machine learning in here and find better weights and thresholds.
From an academic point of view, this area's less interesting, but practically we could improve our current offering by introducing features like a search system to find articles, etc. At some point we'd basically just be re-implementing the BBC news website, though.