Down at the recent Pycon we were excited to hear that Malcolm Tredinnick had taken the downloadable collection dataset from the Powerhouse and was using it to demonstrate some of the issues with working with (semi-)open datasets.
His presentation reveals what every museum knows – the datasets that exist in our collection databases are inherently messy. But we’re always working to improve the quality and structure of these datasets. Without them being publicly available to be worked on in new ways by non-museum people we’d never discover many of the flaws in them.
Here’s his presentation which is well worth watching if you are a developer or museum technologist and thinking of making your raw data available.
There’s some modifications and improvements coming to our downloadable data very soon – data release projects can’t just be a ‘set and forget’ arrangement.
Malcolm’s code for cleaning up our data is up on Github.