So, the goal: expose our digital objects in our digital repository so they can be grabbed as whole sets of data. This makes our data open in an Open Access way, specifically “permitting any user to read, download, copy, distribute, print, search or link to the [digital objects], crawl [their metadata] for indexing, pass them as data to software or use them for any lawful purpose.” Our goal in digitizing the things we digitize is to make them more available for research and use. So this includes being able to not only go to a web site and conduct a search that we provide to find an item or set of items, but also to directly take the information about those items and make use of it beyond the user interface points that we provide. Opening our digital repository objects also allows all of our digital objects to be combined with digital objects opened up at other places. Digital Public Library of America offers a way to combine multiple sets like ours to create larger sets and collections. We need to move beyond only providing access the way we know (web sites with search and browse). We need to set our digital objects out there with enough hooks to make it possible for anyone swooshing by on the Intertubes to grab that data and use it the way they want to use it.
So, the question: where to start? No seriously, that was my first question. Our digital repository had previous attempts at OAI harvestable feeds. These exposed specific sets (collections) in a specific way (DC records and sometimes MODS records via XML). It turns out that all of these attempts involved not going directly from the digital repository but extracting data from a certain point in time into an OAI service and then exposing that data. These feeds are now old and stale (and somehow I’m picturing dried up biscuits and now I am sad – few things are better than a fresh biscuit, with butter and maybe some jam).
The next step was digging into the digital repository and figuring what was possible there to provide data feeds. Turns out, not much. We use Fedora and while it does a terrific job of taking things in, it doesn’t do so well at offering things up for discovery and access. I feel like my discoveries (which I’m sure in Fedora-land are only discoveries to me) might warrant a separate blog post, but I also know the Fedora Commons Mailing Lists exist and seem to be the place to go to ask questions and discuss Fedora-specific angst, so in trying to keep things organized for myself here on the blog, I will hold off for now.
In a serendipitous sort of fashion, our digital repository has undergone Solr-ization in the form of a Blacklight search interface – Digital Collections Search (DCS). This had more to do with the desire to provide better discovery options from us than to open our data. Instead of needing to go to a specific site to search or browse a specific collection (Cushman photographs) or a specific format (Archives Online – and good luck figuring out which finding aids have digital objects attached to them), you can use DCS to take in a better picture of all of our digital objects, or the repository as a whole. You still can’t do anything beyond accessing those web sites we provide for you, so the access part is still restrictive, but the discovery part is improved. From what I understand, anything in our digital repository containing a MODS record is automatically indexed into Solr and offered up through DCS.
So we can’t go directly from our digital repository to expose data, the OAI feeds that were previously implemented are not easily maintained or refreshed (obviously, because stale biscuits), and we appear to have a Solr index that is regularly populated from our digital repository. I also appear to have shifted my goal from exposing our digital repository data to exposing our Solr-indexed data. This seems like the more trendy way to approach the problem (the path involving the most mustaches and skinny jeans, if you will) but how does it survive over the long term?
We are now investigating OAI feeds from Solr (imagine my lack of surprise at the existence of a Github project for just this sort of thing) and what it would mean to adjust the response we provide from Solr so that instead of only going through Blacklight we can provide a response in a variety of ways (JSON, XML, CSV). Maybe this is just what we need and any future changes to available indexing can be managed more gracefully and the real work is involved in making sure our digital repository data is ready for whatever comes along. That is my hopeful ending to an otherwise weird tale of exposing data that should really already be out there.