As part of the transition to Digication, Portfolio is going away! Portfolio will be fully decommissioned on July 1, 2024. As of July 1, 2023, there will be a new content freeze in Portfolio. You will not be able to add new pieces of content to your personal or organizational Portfolio. Existing content is still editable. Please continue to migrate your existing content from Portfolio to Digication. For more information about Digication, click here. For a discussion of options for transitioning your content on Portfolio, click here. To learn more about using Digication in your courses, click here.
  • SaDDL: Similarities and Duplication in Digital Libraries

  • Contact:

    Dr. Peter Organisciak

    RMIS Faculty, University of Denver 

  • Timeline:

    Spring 2018 - Spring 2020


  • About the Grant:

    SaDDL evolved out of a desire to address two main challenges related to accessing library collections: recommending thematically-similar titles and identifying "best" versions of works in heavily redundant digital libraries. In other words, the project aims to improve recommendation and retrieval. Digital materials allow libraries to think on a larger level, conceptualizing and analyzing collections in aggregate, so they can limit redundancies and distractions while improving access and discovery.

    The project uses the HathiTrust Digital Library (HTDL), a large-scale digital repository that offers access to over 16 million library holdings. The HTDL offers publicly-available research data for text-mining and large-scale text analysis. SaDDL uses the Extracted Features (EF) Dataset, a HathiTrust dataset that provides page-level word counts for over five billion pages. The EF Dataset allows for identifying duplications and similarites over a large scale without violating copyright laws. The project's design allows for learning about individual volumes by analyzing large digital collections and then taking that analysis back to the volume level.

    SaDDL focuses on three research questions:

      How do you disambiguate different levels of duplication in a digital text collection? This question seeks to identify levels of duplication in collections and make those differences explicit. This question will be the foundation that the rest of the project builds from.

      How do you determine a "best" copy for access or analysis of each work in a digital library? This question seeks to identify the most accurate and most complete digital representation of a work.

      What are the most similar texts and authors to each volume in the HTDL? The final question seeks to identify works that are related to a work through similar themes and/or style.

    The project plans to produce a whitepaper to publish findings in addition to producing useful code, models, and datasets that emerge. The timeline for the project is two years.

    Ultimately, SaDDL aims to improve collection access by making recommendations, improve the usefulness of the HTDL by reducing amounts of duplicate texts, and distribute the inferred relationships between HTDL texts and authors. While scholars and researchers will be able to turn to this research for insight on large-scale text mining in digital libraries, our hope is that librarians and other information professionals will be able to apply this research to improve areas such as item description, collection development, readers' advisory, and information retrieval.


    Read the IMLS grant announcement here.

  • Shared Work:

    Examining patterns of text reuse in digitized text collections (June 3, 2019)

    Peter Organisciak, Grace Therrell, Maggie Ryan, & Benjamin MacDonald Schmidt 

    Poster presented at the Joint Conference for Digital Libraries

    Characterizing same work relationships in large-scale digital libraries (2019)

    Peter Organisciak, Summer Shetenhelm, Danielle Francisco Albuquerque Vasques, & Krystyna Matusiak

    International Conference on Information 2019: Information in Contemporary Society, pp. 419-425


This portfolio last updated: 15-Oct-2021 8:37 PM