Readings on combining and exposing library data sets
I feel like I’m seeing calls across a variety of subject domains for sharing data and making it easily available and reusable. National funding models in the U.S. are beginning to require sharing of data so this idea of providing your data for others to use is kind of catching on.
I also finally read Aaron Swartz’s posthumously published “A Programmable Web: An Unfinished Work,” which is an important read for a multitude of reasons. He makes his own call for exposing data in ways that make it easy for people to grab data they want or get all of the data and make use of it however they want (Chs. 5-7). His ideas implement this around JSON and web-based technology. I like that but I think there’s probably also still a place for XML in exchanging data in a standardized way or communicating data at an institutional level (feeding our data into DPLA, for example).
With a goal of combining our library data for discovery, access, and reuse, I’ve been trying to uncover a literature review of sorts on combining data sets within a library context. I’ve come upon ideas about how to evaluate and compare data sets for commonalities and how to think about providing data in ways that are actually useful and understandable to researchers outside of the library context. Following is the current state of an annotated bibliography, plus some delicious slices of pie because, well, pie:
Abed, Alea. (2014). Podcast: Project Blacklight, Hydra and libraries in the digital age. Lucidworks. http://www.lucidworks.com/blog/podcast-project-blacklight-hydra-and-libraries-in-the-digital-age/
Bess Sadler from Stanford University discusses Project Hydra and what is happening in new developments. They are trying to improve discovery and access for digital libraries by adding a technology stack onto the inventory system that has been digital repositories up to now. Also improving this inventorysystem by providing self-deposit interfaces. Two new areas of work highlighted were GeoBlacklight for GIS data and displaying archival collections effectively in Blacklight.
Breeding, Marshall. (2005). Plotting a new course for metasearch. Computers in Libraries, 25:2, pp. 27-29.
Breeding makes the case for a giant central search of content instead of federated searching (searching against multiple targets). This provides a single access point instead of multiple search interfaces and lessens the burden of searching multiple targets and needing multiple indexes. Making this switch can be difficult since different providers don’t always make metadata openly available for combining.
Emde, Judith Z., Sara E. Morris, and Monica Claassen-Wilson. (2009). Testing an academic library website for usability with faculty and graduate students. Evidence Based Library and Information Practice, 4:4, pp. 24-36.
This article describes findings from a usability study of a library website. Findings include that graduate students tend to get results that are too broad from federated searching. They have to use quotation marks to be precise and results can be too mixed, making it hard to tell what is what. Federated searching is most helpful to graduate students to point out resources or databases they have not previously used. Another finding was that graduate students want subject-specific searching or limited combined subject searching, not cross-subject searching. Subject-specific resource help is most useful when given within a context, such as a course.
Hofmann, Melissa A. and Sharon Q. Yang. (2011). How next-gen r u? A review of academic OPACs in the United States and Canada. Computers in Libraries 31:6, pp. 26-29.
Initial study that was followed up in 2011 found that of 260 academic libraries surveyed, very few were using federated searching to combine data sources and most were still only offering catalog searching. If there was a discovery layer tool in use, it tended to provide faceted navigation.
Hofmann, Melissa A. (2012). “Discovering” what’s changed: a revisit of the OPACs of 260 academic libraries. Library Hi Tech 30:2, pp. 253-274.
In this 2011 follow-up to a 2009 study that found that discovery layers were not in wide use among academic online library catalogs, more institutions are using discovery layers but there are weaknesses in what these tools can do in terms of unified one-stop searching, recommended items, and relevancy display based on circulation statistics. Interest is shown in the eXtensible Catalog (XC) Metadata Toolkit because it “aggregates metadata from various silos, normalizes (cleans-up) metadata of varying levels of quality, and transform[s]… metadata into a consistent format for use in the discovery layer.” [p. 261]
Johnson, Thomas. (2013). Indexing linked bibliographic data with JSON-LD, BibJSON and Elasticsearch. The Code4Lib Journal, 19. http://journal.code4lib.org/articles/7949
This article describes using JSON to map RDF into JSON-LD (linked data). The main point of interest for me is that indexes were not actually combined but kept separate. This helped to include context along with the index and allowed for different mappings based on discrepancies between data sources. There were no performance issues querying across multiple indexes using JSON.
Kipp, Margaret E. I. (2005). Complementary ordiscrete contexts in online indexing: A comparison of user, creator, and intermediary keywords. Canadian Journal of Information & Library Sciences 29:4, pp. 419-436.
This article describes a study comparing descriptors assigned by different actors in the metadata creation process. 165 articles from CiteULike (a bookmarking web service similar to de.li.cio.us) were compared based on user-provided tags, author-provided keywords, and intermediary-provided descriptors using the Voorbij scale along with structured thesauri from INSPEC and Library Literature to identify broader, narrower, and related terms. The study found that user tags are quite different from author- and intermediary-provided descriptors and can supplement a controlled vocabulary entryway to content. Additionally, providing both abbreviations and long-form terms helped to expand content use to interdisciplinary research.
Limani, Fidan and Vladimir Radevski. (2013). Enrichment of digital libraries with Web 2.0: Resources for enhanced user search experience. 8th Annual South-East European Doctoral Student Conference: Infusing Research and Knowledge in South-East Europe. South-East European Research Center: Thessaloniki, Greece, 2013. pp. 294-300.
This article proposes connecting “traditional” scientific research resources (indexed, categorized, and searchable) with scientific Web 2.0 data (socially maintained scholarly library services like blogs and wikis) by tagging those Web 2.0 data sources with authoritative links. This introduces Semantic Web connections to tie together these data sources and expose digital library collections more effectively, reducing the “search span and effort” on the part of the user. [p. 299]
Stephens, Owen. (2011). Mashups and open data in libraries. Serials: The Journal for the Serials Community 24:3, pp. 245-250.
Stephens argues that making data open involves more than just licensing – it should refer to “the ease with which data can be used, taking into consideration aspects such as format and access mechanisms.” [p. 246] The most common ways library data is shared are via XML, JSON, and, increasingly, RDF but these “formats offered are usually familiar only to those who specialize in library data.” [p. 247] Offering APIs to access data makes it easier to understand and use the data, allowing mashups to occur and new ways to use data possible.
Thomas, Marliese, Dana M. Caudle, and Cecilia M. Schmitz. (2009). To tag or not to tag? Library Hi Tech, 27:3, pp. 411-434.
This article describes a study comparing user-contributed tags to controlled vocabulary subject headings (LCSH) to identify broader, narrower, and related terms to identify new terms via the tags that can be brought in to enhance controlled vocabulary used in a system (a “collabulary”). Kipp’s modification of Voorbij scale was used to look at tags compared to hierarchical relationships from a thesaurus. Tagging is generally for personal use (such as finding something later) so there needs to be an incentive to create tags.
Tillett, Barbara B. (2000). Authority control on the web. In: Bicentennial Conference on Bibliographic Control for the New Millennium: Confronting the Challenges of Networked Resources and the Web (Washington DC, November 15-17, 2000).
This report discusses the concept of a “mandatory minimal set of data elements… in all authority records to facilitate international exchange or use” [p. 5] It shows growing support for authority control to manage different sources of common metadata and the idea of common core data points for aligning and relating records from different sources.
Voorbij, Henk J. (1998). Title keywords and subject descriptors: A comparison of subject search entries of books in the humanities and social sciences. Journal of Documentation, 54:4, pp. 466-476.
This article describes results from two studies – one where librarians compared subject descriptors and words in titles for 475 catalog records and rated them on a scale of 1 (subject is the same as the title) to 7 (subject is not at all in the title) and a second where librarians searched on subject and title words for the same topic. Findings suggest that subject descriptors enhanced recall for searches and 37% of the first study’s records were enhanced by subject descriptors. [The scale used for comparison has been used in other studies (Thomas, et al., 2009; Kipp, 2006) with variations in what is being compared but focusing on comparing different types of metadata.]