• Welcome to Massive Texts.

  • The DU Massive Texts Lab studies massive collections of texts through the development and application of computational methods.

    We work with texts at scales too large for a person to read, applying machine learning, text mining, and natural language processing methods to learn the characteristics, relationships, and topics within documents. At an aggregate scale, we study themes across an entire collection of texts; for example, learning about historical or cultural trends from massive digital libraries, identifying duplicate or variant relationships among books, building language models in service of creativity measurement in educational assessment, and quantifying heterogeneity and outlier passages in political legislature.

  • SaDDL - Similarity and Duplication in Digital Libraries

    SaDDL is an IMLS-funded project (LG-86-18-0061-18) aimed at identifying and labelling duplicate work relationships in digital libraries, currently working with the 17 million works in the HathiTrust Digital Library. A reliable way to recognize fuzzy duplicates can lead to better information access and retrieval as well as supporting cleaner large-scale text analysis and aiding in updating library records to modern FRBR-based cataloguing standards. In addition, the project seeks to identify the "best" copies of each work for access or analysis, and to generate recommendations for similarly-themed texts that libraries can adopt.

    To learn more about the project, visit the SaDDL page.


  • MOODs: Identifying Outlier Passages and Texts in the Legislative Process

    MOODs is a project that analyzes legislative bills for atypical or outlier texts - Misfits, Omnibuses, and Odd Ducks. The project seeks to help people to better understand the legislative process, using computers to help spotlight novel - and potentially of interest - parts of bills for readers as well as creating additional quantitative measures for downstream classification and analysis. MOODs is funded through an internal University of Denver grant.

    To learn more about the project, visit the MOODs page

  • Scoring Creativity Tests
    This project seeks to apply text-mining based methodology to creativity tasks administered to students in schools. One main limitation of educational measurement right now is its over-reliance on multiple-choice items, and students' creativity cannot be assessed through multiple-choice items. Therefore, text-mining models have been an important way to automate the scoring of creativity tests in order to make such testing less expensive and more common in schools. In the future, we hope that a much greater proportion of students in US schools have the chance to respond to a creativity test, giving their teachers, parents, and school-leaders the chance to recognize the creative talent and potential that may otherwise have been hidden. 
    One main product to come from this project to-date is the Open Creativity Scoring (OCS) tool that is available for free here. On this page, verbal creativity tasks can be automatically scored, producing reliable and valid quantities that are designed to represent a person's ability to think originally. 
  • Current Contributors:

      Dr. Peter Organisciak, Director

      Dr. Krystyna Matusiak, Affiliate Faculty

      Dr. Benjamin MacDonald Schmidt, Affiliate Faculty, NYU

      Lindsay Gypin, Graduate Student Associate

      Maggie Ryan, Graduate Student Associate

     Rita Zhang, Graduate Student Associate

     Matthew Durward, Graduate Student Associate

      To learn more about each of our contributors, visit the Members page.


  • Papers:

    Characterizing same work relationships in large-scale digital libraries (2019)

    Peter Organisciak, Summer Shetenhelm, Danielle Francisco Albuquerque Vasques, & Krystyna Matusiak

    International Conference on Information 2019: Information in Contemporary Society, pp. 419-425


    Characterizing same work relationships in large-scale digital libraries (2019)

    Peter Organisciak, Grace Therrell, Maggie Ryan, Benjamin MacDonald Schmidt

    2019 ACMIEEE Joint Conference on Digital Libraries (JCDL)

    doi: 10.1109/JCDL.2019.00071

  • Upcoming Talks:

    Nothing scheduled at this time.

This portfolio last updated: 05-Oct-2020 12:54 PM