Today we went live with another one of the new experimental features of our collection database – auto-generation of tags based on semantic parsing.
Throughout the Museum’s collection database you will now find, in the right hand column of the more recently acquired objects (see a quick sample list), a new cluster of content titled “Auto-generated tags”.
We have been experimenting with Reuters’ OpenCalais web service since it launched in January. Now we have made a basic implementation of it applied to records in our collection database, initially as a way of generating extra structured metadata for our objects. We can extract proper names, places (by continent, country, region, state and city), company names, technologies and specialist terms, from object records all without requiring cataloguers to catalogue in this way. Having this data extracted makes it much easier for us to connect objects by manufacturers, people, and places within our own collection as well as to external resources.
Here’s a brief description of what OpenCalais is in a nutshell from their FAQ –
From a user perspective it’s pretty simple: You hand the web service unstructured text (like news articles, blog postings, your term paper, etc) and it returns semantic metadata in RDF format. What’s happening in the background is a little more complicated.
Using natural language processing and machine learning techniques, the Calais web service looks inside your text and locates the entities (people, places, products, etc), facts (John Doe works for Acme Corp) and events (Jane Doe was appointed as a Board member of Acme Corp) in the text. Calais then processes the entities, facts and events extracted from the text and returns them to the caller in RDF format.
Whilst we store the RDF triples and unique hash, we are not making use of these beyond display right now. There is a fair bit of ‘cleaning up’ we have to do first, and we’d like to enlist your help so read on.
Obviously the type of content that we are asking OpenCalais to parse is complex. Whilst it is ideally suited to the more technical objects in our collection as well as our many examples of product design, it struggles with differentiating between content on some object records.
Here is a good example from a recent acquisition of amateur radio equipment used in the 1970s and 1980s.
The OpenCalais tags generated are as follows –
The bad:
The obvious errors which need deleting are the classification of “Ray Oscilloscope” as a person (although that might be a good name for my next avatar!); “Amateur Microprocessor Teleprinter Over Radio” as a company; the rather sinister “Terminal Unit” as an organisation; and the meaningless “metal” as an industry term.
We have included a simple ‘X’ to allow users to delete the ones that are obviously incorrect and will be tracking its use.
These errors and other like them reveal OpenCalais’ history as Clearforest in the business world. The rules it applies when parsing text as well as the entities that it is ‘aware’ of are rooted in the language of enterprise, finance and commerce.
The good:
On the otherhand, by making all this new ‘auto-generated’ tag data available, users can now traverse our collection in new ways, discovering connections between objects that previous remained hidden deep in blocks of text.
Currently clicking any tag will return a search result for that term in the rest of our collection. In a few hours of demonstrations to registrars and cataloguers today many new connections between objects were discovered, and people, who we didn’t expect to be mentioned in our collection documentation, revealed.
Help us:
Have a play with the auto-tags and see what you can find. Feel free to delete incorrect auto-tags.
We will be improving their operation over the coming weeks, but hope that this is a useful demonstration of some of the potential lying dormant in rich collection records and a real world demonstration of what the ‘semantic web’ might begin to mean for museums. It is important to remember that there is no way that this structured data could be generated manually – the volume of legacy data is too great and the burden on curatorial and cataloguing staff would be too great.