projects


Improving the Linkatron

At the BBC newsHACK the team I was part of built a tool which aligns news reports across languages, so that users reading an article online could select a language from a drop-down box and be presented with a near-parallel article in the language of their choice.

It's important to note that this isn't machine translation, which would be one (crummy) way to get a similar effect, but rather alignment of the articles based on some corresponding features. This is useful for users, but could also be quite useful for the News Labs team themselves -- a corpus of aligned news reports would allow them to develop some fancier tools and introspection on the difference between articles.

For each article, our process was (roughly):

  1. Language Identification -- handled by Google Translate
  2. Machine Translation -- the article summaries and some additional text are translated into English (by Google Translate).
  3. Feature Extraction -- we gather time of publication, mentioned Named Entities and verbs from the (translated) text for each article.
  4. Feature Comparison -- features are compared between candidate pairs of articles (identified by same-date publication and not being the same language), each feature set giving a similarity score between 0-1.
  5. Weighting -- the comparison vectors are weighted by some hand-tweaked weights.
  6. Thresholding -- summed similarities are compared to a hand-tweaked threshold and articles with such values are matched.
  7. Presentation -- the data is fed into a bitchin' UI for users to read from.

This post is my list of areas where I think we could improve the system.

Language ID and MT

The current approach for language ID and MT is quite slow, as we're making calls to Google's translation service over the web. We should look at software approaches for MT which we could run internally -- existing libraries or tools which are as good as Translate.

Feature Extraction

The set of features is rather small at the moment. We could perhaps improve this by looking at URLs from the article body, or re-visiting the image and video comparison idea. Additionally, we could expand/improve the textual similarity comparison by using the multilingual semantic tagger -- if this works well enough, we could even drop the reliance on machine translation entirely.

Weighting/Thresholding

Working with some training data, or even just with un-annotated data for longer, would allow us to get some proper machine learning in here and find better weights and thresholds.

Presentation

From an academic point of view, this area's less interesting, but practically we could improve our current offering by introducing features like a search system to find articles, etc. At some point we'd basically just be re-implementing the BBC news website, though.