Last week I had the extremely good fortune to attend the Joint Conference on Digital Libraries (JCDL) in Newark, New Jersey.
It’s been quite a while since I attended a conference. I think the last one I made it to was SIGIR 2013 when it was hosted in TCD (proof of my presence can be found in this Harlem Shake video… Yeah, we’re pretty cool people). Honestly, I’ve felt very disengaged from the research community at large for the last couple of years now. This stems primarily from major problems I’ve had with concentrating on my PhD, so it felt really good to be involved in a conference with a number of academics whose research goals were similar to my own.
The focus this year was on improving the accessibility of digital libraries. While I generally take the terms “digital library” and “cultural heritage collection” to refer to ancient historical collections, JCDL also included web archives in their definition. Although papers on web archives can be interesting in their own right, they aren’t necessarily useful for my research. However, there were two major points of interest which I took home and am hoping to expand upon for my own work.
The first came from a tutorial I attended on the use of the Digital Public Library of America’s API. DPLA houses a huge collection of several million digitised artefacts. There is a vast amount of data on display here spanning a huge range of cultures and time periods. The organisation have quite generously made their API publicly accessible and encourage developers to implement apps and other software using the information that they provide. Among the data that one can pull down from DPLA are free text descriptions of the artefacts curated and subject tags which give some indication of the nature of the item.
I have done some work in topic analysis as part of the cultural trauma work I did with SPECTRESS, but our problem was always that our topic models wouldn’t stabilise. We focused on unsupervised training methods which meant that even if we did get a stable model, the results wouldn’t exactly be accurate (the literature cites LDA at about 64% accurate under optimal conditions). We would have to manually prune the data. However, using this information available on DPLA, I see no reason why we couldn’t train a classifier in a supervised manner to tag a collection based on the subjects that DPLA provides.
Using the free text descriptions as inputs and the subject tags as outputs, I believe we could teach the computer to broadly recognise the nature of the collections it is curating. I think this would give us a much more accurate and reliable tagging system than the unsupervised alternative. The only problem is that DPLA uses very specific subject tags. Far too specific for our purposes, so we may need to find some way of merging related topics to an appropriate level of abstraction before we begin the process of training.
Perhaps a further task here would be to find a way to link the extracted subjects to some ontological equivalent, providing us with a semantically annotated corpus of digital cultural artefacts.
The second point of interest came from a talk by Annika Hinze, where she talked about using semantically linked data to expand a document’s representation in the collection. Query expansion is a very common practice in information retrieval wherein we inject additional terms into a user’s query which we think might be related to their research goal. This allows us to find more documents which we hope will answer the user’s question. Rather than performing query expansion, Annika expanded on the contents of the documents themselves by injecting synonymous terms derived from an ontology into the collection’s index. To some extent, she folded semantic information into a flat inverted index.
Intuitively I can see such an approach improving the recall of an IR system, but with potentially severe effects on precision. I would very much like to test this and will be experimenting with this method on the Digital Collections archive over the coming weeks. This actually seems to be behaviour which Solr natively supports, so I think it should be easy enough for me to implement it, but experience has taught me that anything that looks easy in computer science, simply isn’t.
One of the questions that I have though is around exactly how the terms should be injected. A term’s weight or relevance within a document is partially dictated by the frequency of its use within the text. If we add a new word, how “relevant” should it be? Does it get the same weight as the synonymous term that led to its injection? I don’t believe so. Not all words are equal even when synonymous. Consider, for example, the word “depression” which may refer either to a mental state or a dip in the landscape. First we have to distinguish between these two completely different meanings when performing the insertion. Assuming we were talking about a feature of the landscape, is the word “hollow” perhaps more synonymous than the word “hill”? Should hollow therefore receive greater weight than hill in the index after its injection? Does it even matter? I think this is worth investigating.
My own submission was a workshop paper, which we submitted to Accessing Cultural Heritage at Scale (ACHS). Workshops are more discursive than other sessions at conferences, so they are a good opportunity to get feedback on what you are doing. I’ve made the slides and my paper available here. My talk was quite abstract due to the project still being in its early stages, but I think it was a good idea to put the work out there, if only to make people aware of what we are doing.
Of course, I was just across the state line from New York, so the day before my flight home I took the opportunity to explore all the touristy areas in Manhattan. That particular experience went by all too quickly. With only a few hours to see as much as possible, I’m surprised by how much I managed to pack into a day. The Statue of Liberty, The World Trade Centre, The New York Public Library, The Empire State Building and more. I also found an amazing music shop called the Guitar Center. I could have stayed in there for days and came agonisingly close to walking out of the shop with a new guitar. Unfortunately, the compounding cost of tax and accessories like a guitar case just made it too infeasible to bring the instrument home. Definitely one of the highlights of the trip though.
I genuinely found the conference to be both a fascinating and inspiring experience. I’ve come back to Dublin with several new ideas for my research and hopefully some new contacts too. Maybe I’ll be able to submit a long paper next year and take part in the main conference. At least it’s something to aim for.