It’s over but I’m not done

Posted on February 15, 2019 by Julie

What I seem to have come up with from these readings this week are a set of classification schemes and controlled vocabularies that have been used over the years to augment or replace DDC, LCC, and LCSH to make collections in libraries and archives more inclusive and combat bias – the racism and sexism and exclusionary “norm” view that is part of these “mainstream” systems.

I’m trying to figure out what to do next. I will be working with a student this summer and before that I will be discussing bias in metadata in a brown bag presentation. I want to understand the landscape for the brown bag and I think I am getting there. I also want to have something concrete for the student to work on. The list of classification schemes and controlled vocabularies is incomplete and there are a couple of meta-lists I have found (LOC has one and Bartoc is another) so those need to be reviewed. I don’t think these meta-lists cover everything in terms of controlled vocabularies representing communities, especially from that community’s point of view, but having that as an output could prove to be really useful more generally in the library community.

Beyond that is this concept I have of trying out something to see if bias can be combatted through the front end search interface in addition to the backend where the metadata is created. There seem to be some possibilities with this using Homosaurus. It is a Linked Data source and the example of its use in IHLIA’s search interface is intriguing to me.

I’m also interested in Olson’s work to connect a controlled vocabulary like that in A Women’s Thesaurus to a “mainstream” source like DDC. If Homosaurus can be connected in some way to LCSH terms (and that is a big if that still needs to be investigated), is there utility in offering that as an entryway resource for searching, to help users connect to items that already have records using only LCSH terms? Olson and Ward created the standalone search application for seeing connections between those 2 sources but I haven’t seen anything about if it was ever implemented for research use in an online system. If there isn’t an equivalent or close term in the mainstream source, then there isn’t much point in connecting a controlled vocab term since it would end up lumping into a mainstream category that is too broad or not connecting to anything. But if there are connections, are they helpful to provide as a different view into a collection?

I’m ending this research leave with a lot of questions and I knew I would, but I think I do have a better handle on how to talk about the problem of non-inclusive or exclusionary online research tools and collections. Additionally and more importantly for the topic of bias in metadata, I have a better sense of what has occurred already in efforts to combat that marginalization and make the research process more inclusive through constructing new or modifying current classification schemes and controlled vocabularies.

Reading, listing, and still learning

Posted on February 12, 2019 by Julie

Today involved digging into details about different classification schemes and controlled vocabularies and I realized I have enough to start a list! I’m interested to see how this list grows and what meta-characteristics they have in common. So far I’m tracking if the classification scheme or controlled vocabulary is available online, if it is in Linked Data format, and where I am finding it (online resource, in a book, in print some other way, etc).

My readings today were about American Indian classification and subject heading issues in Dewey Decimal Classification, Library of Congress Classification, and Library of Congress Subject Headings as well as more information about Dorothy B. Porter and her work to organize, increase, and provide access to the African and African American collections that became the Moorland-Spingarn Research Center at Howard University. Practices for classifying American Indian resources have placed much of this content in the historic past under sections of the catalog about the history of North America (in both DDC and LCC) as if American Indians don’t even exist anymore. And Porter recalled a time when many libraries grouped anything by an African American author under a DDC heading for colonization (and migration). There are clunky ways to somewhat work within these classification systems but only to a point and only for some material. Limitations of DDC to expand and the slow pace of change in LC just seems to allow these problems to languish. So new classification schemes and controlled vocabularies have been developed and I’m learning how they have been used and how they can be applied to aid in the research process. This is where my thoughts turn to Linked Data possibilities but they aren’t well-formed thoughts yet.

And just to make sure I have some warning lights going off in my head regarding Linked Data, I also read about issues of bias in Knowledge Graphs related to the Semantic Web:

data bias (Linked Data from sources being mostly about Europe, Japan, Australia, and the US)
schema bias (depending on the ontology you can get very different results for a concept like the article’s example, theater)
inferential bias (taking data from a source like DBPedia and running inference results in high confidence assumptions from the graph that say things like: “if X is a US president, X is male”).

That graph could use some more learning.

This brings up something that is coming across in other readings. Bias on its own isn’t necessarily a problem. Everyone has implicit biases. It’s when that implicit bias becomes systemic and reflects out as the appropriate or authorized way to organize and interpret classifications and subject matter – bias without recognition or documentation, without transparency, is a problem. Or in the case of this knowledge graph example, results without context show bias.

And now for something kinda sorta different

Posted on August 3, 2015 by Julie

Hey blog, welcome to life – it happens.

Let's pretend life has been mostly along these lines. Also, I'm a sucker for a rainbow lollipop!

biiiiig lick by broterham

So you know those posts I wrote before where I was working on combined data sets from different sources that all have Solr indexes? That topic is – in trying to keep with the cooking/baking/food theme – on the back burner; I’m letting the dough rise; it’s in the fridge marinating; I’m gonna get back to that.

The Rising by scrambldmeggs

Other matters have come to the forefront of my metadata working life, such that I am now embroiled in concepts initially hard to grasp and tasks seemingly insurmountable. It can essentially be summed up in a few bullet points, though, I think:

managing metadata standards as they are expressed in various formats
transitioning data in those metadata standards from one format to another for various uses
explaining all of this so that it is understandable and no one rolls their eyes so far back in their head as to resemble a zombie

I think that last point is the toughest, by far. This stuff becomes so esoteric and full of jargon so quickly. I also think that tends to happen when people don’t quite understand what they are talking about themselves. Just keep reading and you’ll probably see for yourself what I mean.

If not for the caption, you'd be thinking chocolate chip cookie, wouldn't you? It looks like one thing but is actually another - metadata problems in a cookie.

Raisin cookie by hanafan

I read a light-hearted blog called Go Fug Yourself where publicity photos of celebrity fashion are examined for fabulousness and fugliness. The writing is entertaining and no one is personally skewered but it’s gotten to the point where they can’t even talk about people wearing see-through clothing/sheer fabrics anymore. They now talk about it using pizza as an analogy because it is so ubiquitous and terrible at the same time (the see-through clothing, not pizza). It can’t really be discussed any further because it’s been discussed ad nauseam and the writer’s eyes, if not the reader’s, glaze over (but are not see-through).

Krispy Kreme by ableman

I think the same may apply when talking about metadata sometimes. It effects everything and can still be the most difficult part of a project to understand and manage. Maybe it’s better to talk in pure analogies that will keep the reader’s attention than to try and directly discuss, say, the implications of MODS as expressed in RDF (because while complex hierarchical XML is expressible as complex sets of triple stores, it appears to be highly unusable in practical programmatic terms). For folks who read through that sentence without “blah-blah-blahing” in their heads, we have metadata nerdery in common. Yay us. But trying to express how this is relevant to developers, and to the systems they make, and to the end users who need to use those systems just can’t happen when these issues mean different things to different metadata users and it’s all so massively jumbled together with acronyms and metadata codespeak.

So I am endeavoring to address the bullet points above, in article format and with more detail, using cases I know and other cases I know about where these problems are being tackled in various ways. Whether this results in recommendations for ways to handle metadata transformations for different standards and uses or is just an attempt to plainly explain the problems being encountered, it is going to be helpful to me and hopefully helpful to others as well. I would like a donut now, please.

So what is it again that I’m trying to do?

Posted on June 21, 2014 by Julie

I have this concept of opening access to our Fedora data, then combining that with our library catalog data and our library web site data (people and services). But what is the goal for this, who would make use of it, and why would they care? The data is related by subject areas and, in that respect, certainly helpful to researchers to be more tightly coupled than it is now. But how would it work and what could people do with it?

I’ve tried imagining scenarios for using this data as I write narratives for conference and grant proposals but so far my conference proposals and funding tries have fallen kind of flat, so that makes me wonder if this is a thing at all or if I’m making up something that nobody wants? I’m also not sure if I have a research question or merely a software development problem. I think there might be a research question in the questions above? Some frustration, first inclination is to become a monk and leave the situation…

Yeah, that's about like me trying to dance.

@neratema via Flickr

Young MC aside, here are some possibilities: First I have examples of making use of the metadata once it’s opened up from Fedora. For instance, a researcher could analyze geographic locations across time for all digitized photographs at Indiana University. A student could do subject analysis across all sheet music collections we host. A historian could map correspondence from different time periods in American history from our collections. Expanding on those examples to examples of using a data set combining digital collections with the library catalog and the library web site seems like the next step. [taking a break, eating a sandwich, thinking about this one]…

The best thing to accompany a lot of thinking is, of course, a good sandwich.

@calamity_hane via Flickr

How about this: finding non-digital resources, real life people, and actual services at our institution related to any of the above examples would make a combined data set useful. That combined data set would offer research-related resources to help understand the data and people and services to support the analysis, interpretation, and dissemination of findings based on that data. Analyzing geographic locations across time for all digitized collections might connect to geocoding books, articles on time-based mapping projects, subject-specific books and articles related to the photographers or time periods from the digitized photos in Fedora. Librarians and training resources for GIS on the libraries web site would be useful for the historian along with university-wide GIS training, online resources, and contact info.

So that’s one example, and we have all of that data that can be combined to provide this info, I’m sure of it. Is there interest in doing this sort of research with our digitized collections? Historypin has shown interest in taking at least a portion of our digital photograph collection and doing just this sort of mapping. The technology for making data work this way is there, it’s a matter of providing the data so researchers can easily access it (combining and exposing our data) and providing the tools, training, and support needed for these same researchers to explore the data. I don’t think any researcher out there working in a subject area that isn’t technology-focused already has a research question directly related to geo-locating all digital photographs at Indiana University, but that exploration can certainly lead to research questions. It’s the same reason we realized digitized versions of these items needed to be created and made available in the first place – so they can be explored by anyone, anywhere, at any time. This is expanding on the way these collections can be explored. This expansion doesn’t answer specific questions, but it opens doors to explore our history and what we know from different perspectives.

So my research question is: How do you usefully combine digital repository, library catalog, and library web site data so researchers can take it all in and do cool things?

This is a direction but maybe more of a light at the beginning of the tunnel than at the end. Off to do my own discovery…

In a cave? No, seriously, just thinking about discovery and exploration.

@storm-crypt via Flickr

Technicalities of Exposing Digital Repository Data

Posted on March 13, 2014 by Julie

So, the goal: expose our digital objects in our digital repository so they can be grabbed as whole sets of data. This makes our data open in an Open Access way, specifically “permitting any user to read, download, copy, distribute, print, search or link to the [digital objects], crawl [their metadata] for indexing, pass them as data to software or use them for any lawful purpose.” Our goal in digitizing the things we digitize is to make them more available for research and use. So this includes being able to not only go to a web site and conduct a search that we provide to find an item or set of items, but also to directly take the information about those items and make use of it beyond the user interface points that we provide. Opening our digital repository objects also allows all of our digital objects to be combined with digital objects opened up at other places. Digital Public Library of America offers a way to combine multiple sets like ours to create larger sets and collections. We need to move beyond only providing access the way we know (web sites with search and browse). We need to set our digital objects out there with enough hooks to make it possible for anyone swooshing by on the Intertubes to grab that data and use it the way they want to use it.

So, the question: where to start? No seriously, that was my first question. Our digital repository had previous attempts at OAI harvestable feeds. These exposed specific sets (collections) in a specific way (DC records and sometimes MODS records via XML). It turns out that all of these attempts involved not going directly from the digital repository but extracting data from a certain point in time into an OAI service and then exposing that data. These feeds are now old and stale (and somehow I’m picturing dried up biscuits and now I am sad – few things are better than a fresh biscuit, with butter and maybe some jam).

@shutterbean via Flickr

The next step was digging into the digital repository and figuring what was possible there to provide data feeds. Turns out, not much. We use Fedora and while it does a terrific job of taking things in, it doesn’t do so well at offering things up for discovery and access. I feel like my discoveries (which I’m sure in Fedora-land are only discoveries to me) might warrant a separate blog post, but I also know the Fedora Commons Mailing Lists exist and seem to be the place to go to ask questions and discuss Fedora-specific angst, so in trying to keep things organized for myself here on the blog, I will hold off for now.

In a serendipitous sort of fashion, our digital repository has undergone Solr-ization in the form of a Blacklight search interface – Digital Collections Search (DCS). This had more to do with the desire to provide better discovery options from us than to open our data. Instead of needing to go to a specific site to search or browse a specific collection (Cushman photographs) or a specific format (Archives Online – and good luck figuring out which finding aids have digital objects attached to them), you can use DCS to take in a better picture of all of our digital objects, or the repository as a whole. You still can’t do anything beyond accessing those web sites we provide for you, so the access part is still restrictive, but the discovery part is improved. From what I understand, anything in our digital repository containing a MODS record is automatically indexed into Solr and offered up through DCS.

So we can’t go directly from our digital repository to expose data, the OAI feeds that were previously implemented are not easily maintained or refreshed (obviously, because stale biscuits), and we appear to have a Solr index that is regularly populated from our digital repository. I also appear to have shifted my goal from exposing our digital repository data to exposing our Solr-indexed data. This seems like the more trendy way to approach the problem (the path involving the most mustaches and skinny jeans, if you will) but how does it survive over the long term?

We are now investigating OAI feeds from Solr (imagine my lack of surprise at the existence of a Github project for just this sort of thing) and what it would mean to adjust the response we provide from Solr so that instead of only going through Blacklight we can provide a response in a variety of ways (JSON, XML, CSV). Maybe this is just what we need and any future changes to available indexing can be managed more gracefully and the real work is involved in making sure our digital repository data is ready for whatever comes along. That is my hopeful ending to an otherwise weird tale of exposing data that should really already be out there.

Notions

Ideas, supposings, in the workings. And the occasional thread.

Category Archives: Ideas

It’s over but I’m not done

Reading, listing, and still learning

And now for something kinda sorta different

So what is it again that I’m trying to do?

Technicalities of Exposing Digital Repository Data