And now for something kinda sorta different

Posted on August 3, 2015 by Julie

Hey blog, welcome to life – it happens.

Let's pretend life has been mostly along these lines. Also, I'm a sucker for a rainbow lollipop!

So you know those posts I wrote before where I was working on combined data sets from different sources that all have Solr indexes? That topic is – in trying to keep with the cooking/baking/food theme – on the back burner; I’m letting the dough rise; it’s in the fridge marinating; I’m gonna get back to that.

The Rising by scrambldmeggs

Other matters have come to the forefront of my metadata working life, such that I am now embroiled in concepts initially hard to grasp and tasks seemingly insurmountable. It can essentially be summed up in a few bullet points, though, I think:

managing metadata standards as they are expressed in various formats
transitioning data in those metadata standards from one format to another for various uses
explaining all of this so that it is understandable and no one rolls their eyes so far back in their head as to resemble a zombie

I think that last point is the toughest, by far. This stuff becomes so esoteric and full of jargon so quickly. I also think that tends to happen when people don’t quite understand what they are talking about themselves. Just keep reading and you’ll probably see for yourself what I mean.

If not for the caption, you'd be thinking chocolate chip cookie, wouldn't you? It looks like one thing but is actually another - metadata problems in a cookie.

Raisin cookie by hanafan

I read a light-hearted blog called Go Fug Yourself where publicity photos of celebrity fashion are examined for fabulousness and fugliness. The writing is entertaining and no one is personally skewered but it’s gotten to the point where they can’t even talk about people wearing see-through clothing/sheer fabrics anymore. They now talk about it using pizza as an analogy because it is so ubiquitous and terrible at the same time (the see-through clothing, not pizza). It can’t really be discussed any further because it’s been discussed ad nauseam and the writer’s eyes, if not the reader’s, glaze over (but are not see-through).

Krispy Kreme by ableman

I think the same may apply when talking about metadata sometimes. It effects everything and can still be the most difficult part of a project to understand and manage. Maybe it’s better to talk in pure analogies that will keep the reader’s attention than to try and directly discuss, say, the implications of MODS as expressed in RDF (because while complex hierarchical XML is expressible as complex sets of triple stores, it appears to be highly unusable in practical programmatic terms). For folks who read through that sentence without “blah-blah-blahing” in their heads, we have metadata nerdery in common. Yay us. But trying to express how this is relevant to developers, and to the systems they make, and to the end users who need to use those systems just can’t happen when these issues mean different things to different metadata users and it’s all so massively jumbled together with acronyms and metadata codespeak.

So I am endeavoring to address the bullet points above, in article format and with more detail, using cases I know and other cases I know about where these problems are being tackled in various ways. Whether this results in recommendations for ways to handle metadata transformations for different standards and uses or is just an attempt to plainly explain the problems being encountered, it is going to be helpful to me and hopefully helpful to others as well. I would like a donut now, please.

All the slices of pie

Posted on October 28, 2014 by Julie

Readings on combining and exposing library data sets

I feel like I’m seeing calls across a variety of subject domains for sharing data and making it easily available and reusable. National funding models in the U.S. are beginning to require sharing of data so this idea of providing your data for others to use is kind of catching on.

I also finally read Aaron Swartz’s posthumously published “A Programmable Web: An Unfinished Work,” which is an important read for a multitude of reasons. He makes his own call for exposing data in ways that make it easy for people to grab data they want or get all of the data and make use of it however they want (Chs. 5-7). His ideas implement this around JSON and web-based technology. I like that but I think there’s probably also still a place for XML in exchanging data in a standardized way or communicating data at an institutional level (feeding our data into DPLA, for example).

With a goal of combining our library data for discovery, access, and reuse, I’ve been trying to uncover a literature review of sorts on combining data sets within a library context. I’ve come upon ideas about how to evaluate and compare data sets for commonalities and how to think about providing data in ways that are actually useful and understandable to researchers outside of the library context. Following is the current state of an annotated bibliography, plus some delicious slices of pie because, well, pie:

Slice of Cherry Blueberry Pie by digidi via flickr

Abed, Alea. (2014). Podcast: Project Blacklight, Hydra and libraries in the digital age. Lucidworks. http://www.lucidworks.com/blog/podcast-project-blacklight-hydra-and-libraries-in-the-digital-age/

Bess Sadler from Stanford University discusses Project Hydra and what is happening in new developments. They are trying to improve discovery and access for digital libraries by adding a technology stack onto the inventory system that has been digital repositories up to now. Also improving this inventorysystem by providing self-deposit interfaces. Two new areas of work highlighted were GeoBlacklight for GIS data and displaying archival collections effectively in Blacklight.

Maple-Bourbon Pumpkin Pie by djwtwo via flickr

Breeding, Marshall. (2005). Plotting a new course for metasearch. Computers in Libraries, 25:2, pp. 27-29.

Breeding makes the case for a giant central search of content instead of federated searching (searching against multiple targets). This provides a single access point instead of multiple search interfaces and lessens the burden of searching multiple targets and needing multiple indexes. Making this switch can be difficult since different providers don’t always make metadata openly available for combining.

Emde, Judith Z., Sara E. Morris, and Monica Claassen-Wilson. (2009). Testing an academic library website for usability with faculty and graduate students. Evidence Based Library and Information Practice, 4:4, pp. 24-36.

This article describes findings from a usability study of a library website. Findings include that graduate students tend to get results that are too broad from federated searching. They have to use quotation marks to be precise and results can be too mixed, making it hard to tell what is what. Federated searching is most helpful to graduate students to point out resources or databases they have not previously used. Another finding was that graduate students want subject-specific searching or limited combined subject searching, not cross-subject searching. Subject-specific resource help is most useful when given within a context, such as a course.

Hofmann, Melissa A. and Sharon Q. Yang. (2011). How next-gen r u? A review of academic OPACs in the United States and Canada. Computers in Libraries 31:6, pp. 26-29.

Initial study that was followed up in 2011 found that of 260 academic libraries surveyed, very few were using federated searching to combine data sources and most were still only offering catalog searching. If there was a discovery layer tool in use, it tended to provide faceted navigation.

Hofmann, Melissa A. (2012). “Discovering” what’s changed: a revisit of the OPACs of 260 academic libraries. Library Hi Tech 30:2, pp. 253-274.

In this 2011 follow-up to a 2009 study that found that discovery layers were not in wide use among academic online library catalogs, more institutions are using discovery layers but there are weaknesses in what these tools can do in terms of unified one-stop searching, recommended items, and relevancy display based on circulation statistics. Interest is shown in the eXtensible Catalog (XC) Metadata Toolkit because it “aggregates metadata from various silos, normalizes (cleans-up) metadata of varying levels of quality, and transform[s]… metadata into a consistent format for use in the discovery layer.” [p. 261]

Dutch Apple Pie a la mode by mattmendoza via flickr

Johnson, Thomas. (2013). Indexing linked bibliographic data with JSON-LD, BibJSON and Elasticsearch. The Code4Lib Journal, 19. http://journal.code4lib.org/articles/7949

This article describes using JSON to map RDF into JSON-LD (linked data). The main point of interest for me is that indexes were not actually combined but kept separate. This helped to include context along with the index and allowed for different mappings based on discrepancies between data sources. There were no performance issues querying across multiple indexes using JSON.

All We are saying is Give Pie a Chance by bitzcelt via flickr

Kipp, Margaret E. I. (2005). Complementary ordiscrete contexts in online indexing: A comparison of user, creator, and intermediary keywords. Canadian Journal of Information & Library Sciences 29:4, pp. 419-436.

This article describes a study comparing descriptors assigned by different actors in the metadata creation process. 165 articles from CiteULike (a bookmarking web service similar to de.li.cio.us) were compared based on user-provided tags, author-provided keywords, and intermediary-provided descriptors using the Voorbij scale along with structured thesauri from INSPEC and Library Literature to identify broader, narrower, and related terms. The study found that user tags are quite different from author- and intermediary-provided descriptors and can supplement a controlled vocabulary entryway to content. Additionally, providing both abbreviations and long-form terms helped to expand content use to interdisciplinary research.

Limani, Fidan and Vladimir Radevski. (2013). Enrichment of digital libraries with Web 2.0: Resources for enhanced user search experience. 8th Annual South-East European Doctoral Student Conference: Infusing Research and Knowledge in South-East Europe. South-East European Research Center: Thessaloniki, Greece, 2013. pp. 294-300.

This article proposes connecting “traditional” scientific research resources (indexed, categorized, and searchable) with scientific Web 2.0 data (socially maintained scholarly library services like blogs and wikis) by tagging those Web 2.0 data sources with authoritative links. This introduces Semantic Web connections to tie together these data sources and expose digital library collections more effectively, reducing the “search span and effort” on the part of the user. [p. 299]

key lime pie by roboppy via flickr

Stephens, Owen. (2011). Mashups and open data in libraries. Serials: The Journal for the Serials Community 24:3, pp. 245-250.

Stephens argues that making data open involves more than just licensing – it should refer to “the ease with which data can be used, taking into consideration aspects such as format and access mechanisms.” [p. 246] The most common ways library data is shared are via XML, JSON, and, increasingly, RDF but these “formats offered are usually familiar only to those who specialize in library data.” [p. 247] Offering APIs to access data makes it easier to understand and use the data, allowing mashups to occur and new ways to use data possible.

Thomas, Marliese, Dana M. Caudle, and Cecilia M. Schmitz. (2009). To tag or not to tag? Library Hi Tech, 27:3, pp. 411-434.

This article describes a study comparing user-contributed tags to controlled vocabulary subject headings (LCSH) to identify broader, narrower, and related terms to identify new terms via the tags that can be brought in to enhance controlled vocabulary used in a system (a “collabulary”). Kipp’s modification of Voorbij scale was used to look at tags compared to hierarchical relationships from a thesaurus. Tagging is generally for personal use (such as finding something later) so there needs to be an incentive to create tags.

Apple Pie by belochkavita via flickr

Tillett, Barbara B. (2000). Authority control on the web. In: Bicentennial Conference on Bibliographic Control for the New Millennium: Confronting the Challenges of Networked Resources and the Web (Washington DC, November 15-17, 2000).

This report discusses the concept of a “mandatory minimal set of data elements… in all authority records to facilitate international exchange or use” [p. 5] It shows growing support for authority control to manage different sources of common metadata and the idea of common core data points for aligning and relating records from different sources.

Voorbij, Henk J. (1998). Title keywords and subject descriptors: A comparison of subject search entries of books in the humanities and social sciences. Journal of Documentation, 54:4, pp. 466-476.

This article describes results from two studies – one where librarians compared subject descriptors and words in titles for 475 catalog records and rated them on a scale of 1 (subject is the same as the title) to 7 (subject is not at all in the title) and a second where librarians searched on subject and title words for the same topic. Findings suggest that subject descriptors enhanced recall for searches and 37% of the first study’s records were enhanced by subject descriptors. [The scale used for comparison has been used in other studies (Thomas, et al., 2009; Kipp, 2006) with variations in what is being compared but focusing on comparing different types of metadata.]

So what is it again that I’m trying to do?

Posted on June 21, 2014 by Julie

I have this concept of opening access to our Fedora data, then combining that with our library catalog data and our library web site data (people and services). But what is the goal for this, who would make use of it, and why would they care? The data is related by subject areas and, in that respect, certainly helpful to researchers to be more tightly coupled than it is now. But how would it work and what could people do with it?

I’ve tried imagining scenarios for using this data as I write narratives for conference and grant proposals but so far my conference proposals and funding tries have fallen kind of flat, so that makes me wonder if this is a thing at all or if I’m making up something that nobody wants? I’m also not sure if I have a research question or merely a software development problem. I think there might be a research question in the questions above? Some frustration, first inclination is to become a monk and leave the situation…

Yeah, that's about like me trying to dance.

@neratema via Flickr

Young MC aside, here are some possibilities: First I have examples of making use of the metadata once it’s opened up from Fedora. For instance, a researcher could analyze geographic locations across time for all digitized photographs at Indiana University. A student could do subject analysis across all sheet music collections we host. A historian could map correspondence from different time periods in American history from our collections. Expanding on those examples to examples of using a data set combining digital collections with the library catalog and the library web site seems like the next step. [taking a break, eating a sandwich, thinking about this one]…

The best thing to accompany a lot of thinking is, of course, a good sandwich.

@calamity_hane via Flickr

How about this: finding non-digital resources, real life people, and actual services at our institution related to any of the above examples would make a combined data set useful. That combined data set would offer research-related resources to help understand the data and people and services to support the analysis, interpretation, and dissemination of findings based on that data. Analyzing geographic locations across time for all digitized collections might connect to geocoding books, articles on time-based mapping projects, subject-specific books and articles related to the photographers or time periods from the digitized photos in Fedora. Librarians and training resources for GIS on the libraries web site would be useful for the historian along with university-wide GIS training, online resources, and contact info.

So that’s one example, and we have all of that data that can be combined to provide this info, I’m sure of it. Is there interest in doing this sort of research with our digitized collections? Historypin has shown interest in taking at least a portion of our digital photograph collection and doing just this sort of mapping. The technology for making data work this way is there, it’s a matter of providing the data so researchers can easily access it (combining and exposing our data) and providing the tools, training, and support needed for these same researchers to explore the data. I don’t think any researcher out there working in a subject area that isn’t technology-focused already has a research question directly related to geo-locating all digital photographs at Indiana University, but that exploration can certainly lead to research questions. It’s the same reason we realized digitized versions of these items needed to be created and made available in the first place – so they can be explored by anyone, anywhere, at any time. This is expanding on the way these collections can be explored. This expansion doesn’t answer specific questions, but it opens doors to explore our history and what we know from different perspectives.

So my research question is: How do you usefully combine digital repository, library catalog, and library web site data so researchers can take it all in and do cool things?

This is a direction but maybe more of a light at the beginning of the tunnel than at the end. Off to do my own discovery…

In a cave? No, seriously, just thinking about discovery and exploration.

@storm-crypt via Flickr

Banging My Head Against the Sun

Posted on March 26, 2014 by Julie

It sounds painful and it kind of was, but progress has been made in the quest to open our digital repository data and that progress involves Solr. To wit, there are a few things I have learned recently:

I can actually figure out a lot more with Solr than I thought
Trying to implement a Solr plugin when you are somewhat challenged to make Solr run and index records in the first place is a clear indication of one’s willfulness
The following movies are really good for taking breaks: Fierce Creatures, The American, the new Muppets movie, and Monuments Men (breaks are important and George Clooney seems to help too!)

The latest chapter of my story begins in Bloomington and ends in Amsterdam, with a few trips around the sun in between. Let’s begin by talking about Solr because we all know what happened when I contemplated opening up the data directly from Fedora.

If you want to grab a version of Solr and give it a whirl, nothing is easier than hitting the Solr download page, unzipping/un-tarring the latest version, and running the magic “java -jar start.jar” command in the magic “core” directory. On a Mac or Linux box, of course (who knows what happens on Windows). You can even index some sample records with a different java command and the admin interface gives you search results and it just works. Magic! Everything’s great. Then, you want to do something more. Like try this Solr install with a giant index of actual data and then see if you can add a plugin for OAI feeds. What could go wrong?

Problem: Giant index of data is for a different Solr version than the one you installed
Problem: OAI plugin requires a different Solr version (3.4.0) than the one you installed (4.7.0) and the one where you got the giant index of data (1.4.1)
Problem: Even if you could get that giant index of data working on a Solr version that would also work for the OAI plugin, you can’t figure out where to put the OAI plugin’s binary (.jar file) because the magic install you used works with Jetty, a self-contained Java server, that mixes everything up between the Java code and the Solr code

We next embark on installing Solr to run on Tomcat, because that’s how I give up on things.

“Brand New Tomcats [CLR]” by Grigor Hristov

I had a giant index of data from Solr 1.4.1 and the oai4solr plugin needed Solr 3.4.0 at least, so 3.4.0 is what I chose and I backed off from using the giant index of data. Instead, I found the pieces we use to index certain types of MODS records from our Fedora digital repository into Solr 1.4.1 and, after copying the important pieces from the Solr 1.4.1 config (I hope), I’ve managed to recreate what could be our Solr index from Fedora using 3.4.0 – with 4 records of data.

Essentially, for those who like pictures, I was dealing with one version of Solr that had an index, like a piece of bread with a pat of butter on it.

Pictures of food always help a blog post, I say.

“Bread & Butter” by Mark H. Anbinder

Then I had another version of Solr that also had an index, like a broom and a dustpan.

“sikth y su amado recogedor” by Iris Shyroii

What I was initially trying to do was butter my broom, or apply an index from one version of Solr to a different version of Solr. This is not a thing, apparently.

OK, a broom made out of butter is apparently a thing, but buttering your broom is not, trust me.

“Butter broom and Monster Book of Monsters” by Sarah Cady

Once Solr was installed on Tomcat, all .jar files had a single place to go and I knew where to put the oai4solr.jar file. But OAI feeds don’t just happen, they are a specialized metadata format surrounding records in more commonly recognizable metadata formats (Dublin Core, for example) and can be called up using a specific URL with specific parameters. So stuff has to be programmed (.jar file) but stuff also has to be configured, and I wasn’t configuring anything well in the code that accompanied the .jar file.

Spending your Sunday afternoon with Java error messages on your localhost server sucks. In the end, clearer minds prevailed (Cliffster) and I was convinced that it was reasonable to email the guy who wrote the oai4solr plugin and put it on Github in the first place. In Amsterdam. On a Sunday night.

He answered. From Amsterdam. On a Sunday night. Lucien van Wouw from the International Institute of Social History, the author of oai4solr, helped me fix up my plugin configuration and I had a working oai4solr plugin on Monday morning. I now have OAI feeds from my Solr index returning sets and record lists in DC and MODS. (I’m not exactly sure how to say his last name, but I’m going with Wow! – including exclamation point – because that was awesome.)

“WoW!” by Laurie Chipps

So now I have a different Solr version with a super-small number of records, possibly configured the same as the original Solr index, definitely with a couple new things that need to be included when indexing Fedora records into Solr, and absolutely with other collections in Fedora that still need to receive this mapping treatment so they can have descriptive information and be included in our Solr index. But most importantly, there is a way forward to open our data with OAI feeds. I banged my head against the Sun and kinda made something happen!

Technicalities of Exposing Digital Repository Data

Posted on March 13, 2014 by Julie

So, the goal: expose our digital objects in our digital repository so they can be grabbed as whole sets of data. This makes our data open in an Open Access way, specifically “permitting any user to read, download, copy, distribute, print, search or link to the [digital objects], crawl [their metadata] for indexing, pass them as data to software or use them for any lawful purpose.” Our goal in digitizing the things we digitize is to make them more available for research and use. So this includes being able to not only go to a web site and conduct a search that we provide to find an item or set of items, but also to directly take the information about those items and make use of it beyond the user interface points that we provide. Opening our digital repository objects also allows all of our digital objects to be combined with digital objects opened up at other places. Digital Public Library of America offers a way to combine multiple sets like ours to create larger sets and collections. We need to move beyond only providing access the way we know (web sites with search and browse). We need to set our digital objects out there with enough hooks to make it possible for anyone swooshing by on the Intertubes to grab that data and use it the way they want to use it.

So, the question: where to start? No seriously, that was my first question. Our digital repository had previous attempts at OAI harvestable feeds. These exposed specific sets (collections) in a specific way (DC records and sometimes MODS records via XML). It turns out that all of these attempts involved not going directly from the digital repository but extracting data from a certain point in time into an OAI service and then exposing that data. These feeds are now old and stale (and somehow I’m picturing dried up biscuits and now I am sad – few things are better than a fresh biscuit, with butter and maybe some jam).

@shutterbean via Flickr

The next step was digging into the digital repository and figuring what was possible there to provide data feeds. Turns out, not much. We use Fedora and while it does a terrific job of taking things in, it doesn’t do so well at offering things up for discovery and access. I feel like my discoveries (which I’m sure in Fedora-land are only discoveries to me) might warrant a separate blog post, but I also know the Fedora Commons Mailing Lists exist and seem to be the place to go to ask questions and discuss Fedora-specific angst, so in trying to keep things organized for myself here on the blog, I will hold off for now.

In a serendipitous sort of fashion, our digital repository has undergone Solr-ization in the form of a Blacklight search interface – Digital Collections Search (DCS). This had more to do with the desire to provide better discovery options from us than to open our data. Instead of needing to go to a specific site to search or browse a specific collection (Cushman photographs) or a specific format (Archives Online – and good luck figuring out which finding aids have digital objects attached to them), you can use DCS to take in a better picture of all of our digital objects, or the repository as a whole. You still can’t do anything beyond accessing those web sites we provide for you, so the access part is still restrictive, but the discovery part is improved. From what I understand, anything in our digital repository containing a MODS record is automatically indexed into Solr and offered up through DCS.

So we can’t go directly from our digital repository to expose data, the OAI feeds that were previously implemented are not easily maintained or refreshed (obviously, because stale biscuits), and we appear to have a Solr index that is regularly populated from our digital repository. I also appear to have shifted my goal from exposing our digital repository data to exposing our Solr-indexed data. This seems like the more trendy way to approach the problem (the path involving the most mustaches and skinny jeans, if you will) but how does it survive over the long term?

We are now investigating OAI feeds from Solr (imagine my lack of surprise at the existence of a Github project for just this sort of thing) and what it would mean to adjust the response we provide from Solr so that instead of only going through Blacklight we can provide a response in a variety of ways (JSON, XML, CSV). Maybe this is just what we need and any future changes to available indexing can be managed more gracefully and the real work is involved in making sure our digital repository data is ready for whatever comes along. That is my hopeful ending to an otherwise weird tale of exposing data that should really already be out there.

Metadata and the Meaning of Life

Posted on September 23, 2012 by Julie

Not that I’m trying to figure out big stuff here or anything. But metadata is big, it’s how we make connections. It’s how we give things meaning. Otherwise, everything is a series of level, equivalent parts. Those parts can be found individually and segments can be spotted in the wild. But putting whole things in context, connecting things together, defining things similarly – this is how we figure out what it means to be human, what it means to be compassionate, what it means to care. Granted, this is also how we figure out how to hate and stereotype and kill. But I do think things made from love will, in the end, make things better and make us better. So I will always try to approach metadata with love and positive focus. Call it my Metadata Manifesto.

I want to create ways of connecting ideas, creating new meaning. It’s about providing new doors that can be opened, new opportunities that come from excitement instead of fear. New challenges that are welcome instead of worrisome.

Our ability to understand comes from somewhere. Our collective knowledge means something. Making those connections probably won’t reveal the answer to the meaning of life or anything. (Besides, don’t we already know that’s 42?) However, the more we know and understand, the more meaningful life becomes. We learn to empathize and know what it’s like to be in situations that we ourselves have not experienced. We expand beyond our physical boundaries and make connections – we are metadata.

Notions

Ideas, supposings, in the workings. And the occasional thread.

Category Archives: Metadata