Category: Collection databases

OPAC2.0 – OpenCalais meets our museum collection / auto-tagging and semantic parsing of collection data

Post author By Seb Chan
Post date March 31, 2008
3 Comments on OPAC2.0 – OpenCalais meets our museum collection / auto-tagging and semantic parsing of collection data

Today we went live with another one of the new experimental features of our collection database – auto-generation of tags based on semantic parsing.

Throughout the Museum’s collection database you will now find, in the right hand column of the more recently acquired objects (see a quick sample list), a new cluster of content titled “Auto-generated tags”.

We have been experimenting with Reuters’ OpenCalais web service since it launched in January. Now we have made a basic implementation of it applied to records in our collection database, initially as a way of generating extra structured metadata for our objects. We can extract proper names, places (by continent, country, region, state and city), company names, technologies and specialist terms, from object records all without requiring cataloguers to catalogue in this way. Having this data extracted makes it much easier for us to connect objects by manufacturers, people, and places within our own collection as well as to external resources.

Here’s a brief description of what OpenCalais is in a nutshell from their FAQ –

From a user perspective it’s pretty simple: You hand the web service unstructured text (like news articles, blog postings, your term paper, etc) and it returns semantic metadata in RDF format. What’s happening in the background is a little more complicated.

Using natural language processing and machine learning techniques, the Calais web service looks inside your text and locates the entities (people, places, products, etc), facts (John Doe works for Acme Corp) and events (Jane Doe was appointed as a Board member of Acme Corp) in the text. Calais then processes the entities, facts and events extracted from the text and returns them to the caller in RDF format.

Whilst we store the RDF triples and unique hash, we are not making use of these beyond display right now. There is a fair bit of ‘cleaning up’ we have to do first, and we’d like to enlist your help so read on.

Obviously the type of content that we are asking OpenCalais to parse is complex. Whilst it is ideally suited to the more technical objects in our collection as well as our many examples of product design, it struggles with differentiating between content on some object records.

Here is a good example from a recent acquisition of amateur radio equipment used in the 1970s and 1980s.

The OpenCalais tags generated are as follows –

The bad:

The obvious errors which need deleting are the classification of “Ray Oscilloscope” as a person (although that might be a good name for my next avatar!); “Amateur Microprocessor Teleprinter Over Radio” as a company; the rather sinister “Terminal Unit” as an organisation; and the meaningless “metal” as an industry term.

We have included a simple ‘X’ to allow users to delete the ones that are obviously incorrect and will be tracking its use.

These errors and other like them reveal OpenCalais’ history as Clearforest in the business world. The rules it applies when parsing text as well as the entities that it is ‘aware’ of are rooted in the language of enterprise, finance and commerce.

The good:

On the otherhand, by making all this new ‘auto-generated’ tag data available, users can now traverse our collection in new ways, discovering connections between objects that previous remained hidden deep in blocks of text.

Currently clicking any tag will return a search result for that term in the rest of our collection. In a few hours of demonstrations to registrars and cataloguers today many new connections between objects were discovered, and people, who we didn’t expect to be mentioned in our collection documentation, revealed.

Help us:

Have a play with the auto-tags and see what you can find. Feel free to delete incorrect auto-tags.

We will be improving their operation over the coming weeks, but hope that this is a useful demonstration of some of the potential lying dormant in rich collection records and a real world demonstration of what the ‘semantic web’ might begin to mean for museums. It is important to remember that there is no way that this structured data could be generated manually – the volume of legacy data is too great and the burden on curatorial and cataloguing staff would be too great.

Collection databases Imaging

Microsoft Seadragon, Silverlight and collections

Post author By Seb Chan
Post date March 6, 2008
1 Comment on Microsoft Seadragon, Silverlight and collections

Last year there was an incredible presentation at TED which featured a demonstration of Seadragon, a technology that Microsoft licensed and has continued to develop.

Whilst the BBC and others have been using the Seadragon spinoff Photosynth quite effectively, Seadragon itself seems to have the most immediate use within the cultural sector with our large volume of 2D digitised resources.

Collection databases Web 2.0

OPAC2.0 – new context features – collections, ‘parts’, and narratives

Post author By Seb Chan
Post date January 7, 2008
6 Comments on OPAC2.0 – new context features – collections, ‘parts’, and narratives

As promised some of the new features of our collection database have started to go live.

We have been spending a lot of time working through a range of legacy issues to do with how collection data is structured in our collection management system and how this affects the options for its more flexible use on the web. Part of the problem lies in the way in which museum professionals classify, and another part in how these classification practices become hard-coded into collection management software.

A very big problem for us has been ‘parts’ and ‘collections’ and how they have been historically catalogued. An example is our Hedda Morrison collection of photographs. In our collection management system this set of 349 photographs has an object number (92/1414) with associated structured data. Then, each of the 349 photographs has their own object number (92/1414-1 through to 92/1414/349) related to that of the parent with a lesser set of structured data – because it is assumed that it inherits some data from its parent.

Now this works perfectly if the user views (and reads) the parent object (trunk) first and then digs down to the part objects (branch) and their associated data (leaves). But in the new world of search it is far more likely that a user will start at a leaf not only because they are more plentifully represented in the search results which flattens the tree structure, but also because it is actually what they are looking for (a photo of an “Itinerant barber” for example).

So finally we have started to reveal these parent/child relationships on object pages. Now looking at the collection record for the aforementioned “Itinerant barber” we reveal that it belongs to a parent object.

You will also notice that for some objects (the barber photo is one) we also reveal that it belong to a ‘collection’. These are even broader groupings of objects that cross into the forest of other object trees. The barber photograph belongs to the ‘Hedda Morrison Collection’ which also contains her camera, passport, papercuts, Chinese belt toggles and much more. Again by revealing this relationship we open up new pathways for users to traverse the collection.

Here’s the Jenny Kee Collection – a collection and archive of the work of a famous Australian designer.

All this work is a prelude to a much larger feature that will go live very soon – narratives. (Although I think we will call them ‘themes’ on the site). Narratives will operate in a similar manner but allow for much more free association between objects. Narratives will also contain their own text and images so that they operate a little like object groupings. We might have one on 20th Century Australian Design written by a curator, or one on Flying Machines written by one of education staff. There will probably come a time when users will be able to submit their own object clusters.

One of the computing curators, Stephen Jones, came up with the notion of ‘heterarchical narratives’ (possibly after a quick lunchtime re-reading of Deleuze). This is a good way of describing the way in which they will act as fluid nodes for contextual collection information. Not only are they much more fluid in terms of navigation, they are also much looser in terms of internal structures of control in terms of knowledge production. Anyone with access to the collection management system can create a new node and associate collection objects with their node – this opens up plenty of opportunity for cross-disciplinary narratives, and cross-organisational collaboration.

Collection databases Web 2.0

New York Times on their own use of collective search intelligence

Post author By Seb Chan
Post date January 4, 2008
1 Comment on New York Times on their own use of collective search intelligence

Here’s a short piece from the NYT Tech Blog on how the New York Times is using realtime analysis of site search to improve results.

Regular readers will know that we’ve been doing this over at the Powerhouse on our OPAC for a long time. The principles are the same and the use of actual users search relationships can greatly assist the navigation of other users.

Collection databases Digitisation Geotagging & mapping

Brantley on digital collections and the location-awareness OPAC

Post author By Seb Chan
Post date October 19, 2007

Peter Brantley over at O’Reilly has put together a short post on his vision of the future of collections – specifically those held by university libraries – which should have resonance with those in collecting museums.

Collection databases Digitisation Folksonomies Web 2.0

OPAC2.0 – latest tag statistics and trends for simple comparison with Steve project

Post author By Seb Chan
Post date October 15, 2007
1 Comment on OPAC2.0 – latest tag statistics and trends for simple comparison with Steve project

Another paper from the Steve researchers has gone online and is generating interesting discussions. It elaborates on the content of an earlier summary podcast. To be presented at ICHIM07 the paper describes some of the emerging patterns in tagging behaviour in the different interface trials.

Collection databases Conferences and event reports Folksonomies Web 2.0 Web metrics

Web Directions South 2007 – presentation and some thoughts

Post author By Seb Chan
Post date October 4, 2007

Web Directions South 07 was lots of fun and there were some great presentations over the two days. Unfortunately conferences are always full of choices and I missed several presentations I’d been looking forward to catching. That said, overall the quality was high and there were only a handful of dull moments. Most of the presentations I saw were not on the tech-side (JS, Ajax, CSS etc) of things – Luke was there to go to those.

Here’s some notes from my highlights.

Cameron Adams managed to pack out one of the smaller rooms and by the time his ‘Future of web based interfaces’ was in full flow there were about 50 people standing at the back. Adams’ presentation went through the possibilities of flexible interfaces that are both customisable by the user (much like Netvibes or iGoogle is) and automatically reformats as you use it (like the BBC News pages subtly do).

After my own presentation (see below) it was on to Scott Gledhill’s ‘Is SEO evil?‘ to which the answer is, of course, no. SEO and a web standards approach should be complimentary. Scott had some lovely images – the menacing gummi bears in particular – and a fascinating case study from News Digital Media around the Steve Irwin death. In this instance, News went out with a web headline that was far more immediate and keyword loaded (“steve irwin dead”) than their major competitor, Sydney Morning Herald/Fairfax who were more obscure (“crocodile man reported dead”). They tracked the story traffic and referrers by the hour and more than doubled the Fairfax traffic – even after Fairfax adjusted their headline. Scott also told how journalists are now much more SEO content-savvy in their writing and that his team gives the journalists the necessary web reporting tools to be able to track their own stories. This, combined with the highly competitive environment, encourages journalists to further refine and re-edit their stories for performance even after initial publication.

The second day began with an edit of Scott Berkun’s famous Myths of Innovation presentation. Scott’s main message is that you can’t force ‘innovation’ and that it needs time and space to happen organically. In fact, one of the best triggers of innovation are failures and mistakes. He suggests that perhaps we should start including a ‘failures’ budget line in our organisational budgets – accept that they will happen and that we are all the better for it.

George Oates from Flickr spoke about how Flickr manages and facilitates user communities. She started out tracing Flickr back to its origins at Ludicorp as a sort-of MMORPG called Game Neverending. After GNE folded the community that had grown around it was imported directly into Flickr and they brought their experiences from the game world into the construction and design of Flickr. I found her focus on users and the real need for human-to-human communication and relationship management that Flickr does a timely reminder that in the museum world we cannot expect communities to ‘just happen’ around our content and that when the seeds of community appear they need careful nurturing. The necessary nurturing is impossible if you move immediately on to the next project.

Adrian Holovaty, the mind behind Chicago Crime and several other datamining and visualisation projects gave a fascinating presentation about the hidden potential of structured data. Now over in the museum world we are experts at structured data but we rarely make the most of it. Throughout Holovaty’s talk he kept coming back to the ideas of serendipity and free browsing that I’ve been working on with our OPAC. His position was to make everything hyperlinked and let the users build their own paths through the data. To that end he built the Django Databrowse application which takes a database and basically build a simple website that allows users to link from anything to anything else. Following Chicago Crime which took flat datasets from the Chicago Police Department and made them navigable in ways that the Chicago PD had never intended (view crimes by area, visualise hotspots, map your jogging route against reported crimes etc), Holovaty went on to do some great visualisation work at the Washington Post. Here he asked journalists to enter their notes into a simple database as well as turning their notes into stories. This allowed him to build the Faces of the Fallen which tracks and maps every US soldier killed in Iraq. Faces not only reveals some uncomfortable patterns in the data (deaths by age of soldier, by state etc), it also has allowed linkages to family tributes and newspaper articles about the circumstances of their death. The project returns great value back to reporters and the paper who can now report ‘milestones’ and trends, but also to the community who can now make ‘more sense’ out of what would otherwise be simply seen as a list of names. It ‘humanises’ the data, giving it far greater impact. Holovaty is now working on a community news project Every Block which intends to harvest and aggregate content by ‘block’ from various news sources – automatically creating journalistic stories from raw data. (Reuters already does this with some financial reporting).

There are a growing selection of presentation slides over at Slideshare.

Here’s an edited version of my own presentation slides which use the Powerhouse Museum’s collection search and tagging implementation as a case study of a government implementation of Web 2.0 techniques. Those who have seen my presentations over the recent months will recognise some re-use and re-puposing. For various reasons I have had to remove about 20-30 slides but most of it is there. There is a podcast coming apparently.

Collection databases Interactive Media Young people & museums

C is for collection – an ABC book with collection objects

Post author By Seb Chan
Post date September 10, 2007

Two weeks ago we made a simple ABC book for young children available on our children’s website. It is called ‘C is for collection‘ and is a very basic extension of our online collection built in Flash with an XML file supplying the necessary collection data allowing for easy expansion.

A longer term objective of ‘C is for collection’ is to build a database of child-friendly object descriptions and explore the options for a children’s tagging game with the same XML.

Have a play and remember to turn up your speakers. More children’s games are coming shortly.

Collection databases Web 2.0

OPAC2.0 – Latest features update

Post author By Seb Chan
Post date August 12, 2007

We’ve added a whole range of new features to our OPAC that we think further enhance its usability.

Tooltips –

Each ‘feature’ on the search results and object view pages now has an explanatory tooltip. Given the OPAC has become quite complex and there is a lot going on on the screen now, we felt CSS tooltips offered a more practical solution than a ‘help’ screen or more text in the form of user documentation. More tooltips will be added this week to explain museum-centric language like ‘statement of significance’.

Failed search suggestions –

Now when a search term is misspelled or return no result our system generates a series of possible ‘alternatives’. This is generated on the fly using a calculation called Levenshtein distance. This cycles through each letter of the misspelt word and then queries our table of successful searches for possible matches. These are then ranked and the top 8 variants are presented to the user. In order to make this reasonably quick we have had to rebuild quite a bit of our search technology.

Opensearch RSS with thumbnails –

About two months ago our Opensearch feed was updated to include thumbnails in search results. We added the thumbnails to ensure that our feed delivered optimal results to the National Library of Australia’s Libraries Australia search. We also use this modified RSS to drive search results of Design Hub.

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: