Chris Sizemore at the BBC’s Radio Labs demonstrates an experiment in automated metadata, much akin to Open Calais.
Sizemore has taken Wikipedia and has built a simple web application that uses Wikipedia to disambiguate entities in a block of text and suggest broad categories for the content. Because Wikipedia has broad coverage of topics and deep coverage of specific niches, it can provide, as Sizemore writes, for some areas (especially popular culture), a good enough data source for automated classification.
Here’s Sizemore’s methodology –
1. Download entire contents of the English language Wikipedia (careful, that’s a large 4GB+ xml file!)
2. Parse that compressed XML file into individual text files, one per Wikipedia article (and this makes things much bigger, to the tune of 20GB+, so make sure you’ve got the hard drive space cleared)
3. Use a Lucene indexer to create a searchable collection (inc. term vectors) of your new local Wikipedia text files, one Lucene document per Wikipedia article
4. Use Lucene’s ‘MoreLikeThis’ to compare the similarity of a chunk of your own text content to the Wikipedia documents in your new collection
5. Treat the ranked Wikipedia articles returned as suggested categories for your text
Basically what is going on here is that the text you wish to classify is compared to Wikipedia articles and the articles with the ‘closest match’ in terms of content, have their URLs thrown back as potential classification categories.
Combine this with Open Calais and there will be some very interesting results across a broad range of text datasets.
As regular readers will know, we’ve been experimenting quite a bit with Open Calais at the Powerhouse with some exciting initial results. We’ve been looking at the potential of Calais in combination with other data sources including Wikipedia/dbPedia/Freebase and we’ll be watching Sizemore’s experiment with interest.
Perhaps my throwaway line in recent presentations that ‘humans should never have to create metadata’ might actually be becoming closer to a reality.