DU Portfolio

SaDDL: Similarities and Duplication in Digital Libraries

Contact:

Dr. Peter Organisciak

RMIS Faculty, University of Denver

 peter.organisciak@du.edu

Timeline:

 Spring 2018 - Spring 2020

About the Grant:

SaDDL evolved out of a desire to address two main challenges related to accessing library collections: recommending thematically-similar titles and identifying "best" versions of works in heavily redundant digital libraries. In other words, the project aims to improve recommendation and retrieval. Digital materials allow libraries to think on a larger level, conceptualizing and analyzing collections in aggregate, so they can limit redundancies and distractions while improving access and discovery.

The project uses the HathiTrust Digital Library (HTDL), a large-scale digital repository that offers access to over 16 million library holdings. The HTDL offers publicly-available research data for text-mining and large-scale text analysis. SaDDL uses the Extracted Features (EF) Dataset, a HathiTrust dataset that provides page-level word counts for over five billion pages. The EF Dataset allows for identifying duplications and similarites over a large scale without violating copyright laws. The project's design allows for learning about individual volumes by analyzing large digital collections and then taking that analysis back to the volume level.

SaDDL focuses on three research questions:

 How do you disambiguate different levels of duplication in a digital text collection? This question seeks to identify levels of duplication in collections and make those differences explicit. This question will be the foundation that the rest of the project builds from.

 How do you determine a "best" copy for access or analysis of each work in a digital library? This question seeks to identify the most accurate and most complete digital representation of a work.

 What are the most similar texts and authors to each volume in the HTDL? The final question seeks to identify works that are related to a work through similar themes and/or style.

The project plans to produce a whitepaper to publish findings in addition to producing useful code, models, and datasets that emerge. The timeline for the project is two years.

Ultimately, SaDDL aims to improve collection access by making recommendations, improve the usefulness of the HTDL by reducing amounts of duplicate texts, and distribute the inferred relationships between HTDL texts and authors. While scholars and researchers will be able to turn to this research for insight on large-scale text mining in digital libraries, our hope is that librarians and other information professionals will be able to apply this research to improve areas such as item description, collection development, readers' advisory, and information retrieval.

Read the IMLS grant announcement here.

Shared Work:

Examining patterns of text reuse in digitized text collections (June 3, 2019)

Peter Organisciak, Grace Therrell, Maggie Ryan, & Benjamin MacDonald Schmidt

Poster presented at the Joint Conference for Digital Libraries

Characterizing same work relationships in large-scale digital libraries (2019)

Peter Organisciak, Summer Shetenhelm, Danielle Francisco Albuquerque Vasques, & Krystyna Matusiak

International Conference on Information 2019: Information in Contemporary Society, pp. 419-425

doi:10.1007/978-3-030-15742-5_40

Massive Texts Lab

Massive Texts Lab

SaDDL: Similarities and Duplication in Digital Libraries

Examining patterns of text reuse in digitized text collections (June 3, 2019)

Characterizing same work relationships in large-scale digital libraries (2019)