Welcome to Massive Texts.
The DU Massive Texts Lab studies massive collections of texts through the development and application of computational methods.
We work with texts at scales too large for a person to read, applying machine learning, text mining, and natural language processing methods to learn the characteristics, relationships, and topics within documents. At an aggregate scale, we study themes across an entire collection of texts; for example, learning about historical or cultural trends from massive digital libraries, identifying duplicate or variant relationships among books, building language models in service of creativity measurement in educational assessment, and quantifying heterogeneity and outlier passages in political legislature.
SaDDL - Similarity and Duplication in Digital Libraries
SaDDL is an IMLS-funded project (LG-86-18-0061-18) aimed at identifying and labelling duplicate work relationships in digital libraries, currently working with the 17 million works in the HathiTrust Digital Library. A reliable way to recognize fuzzy duplicates can lead to better information access and retrieval as well as supporting cleaner large-scale text analysis and aiding in updating library records to modern FRBR-based cataloguing standards. In addition, the project seeks to identify the "best" copies of each work for access or analysis, and to generate recommendations for similarly-themed texts that libraries can adopt.
To learn more about the project, visit the SaDDL page.
MOODs: Identifying Outlier Passages and Texts in the Legislative Process
MOODs is a project that analyzes legislative bills for atypical or outlier texts - Misfits, Omnibuses, and Odd Ducks. The project seeks to help people to better understand the legislative process, using computers to help spotlight novel - and potentially of interest - parts of bills for readers as well as creating additional quantitative measures for downstream classification and analysis. MOODs is funded through an internal University of Denver grant.
To learn more about the project, visit the MOODs page.
Dr. Peter Organisciak, Director
Dr. Krystyna Matusiak, Affiliate Faculty
Lindsay Gypin, Graduate Student Associate
Andy Lawder, Graduate Student Associate
Neba Nfonsang, Graduate Student Associate
Maggie Ryan, Graduate Student Associate
Summer Shetenhelm, Graduate Student Associate
Grace Therrell, Graduate Student Associate
Danielle Vasques, Graduate Student Associate
To learn more about each of our contributors, visit the Members page.
Characterizing same work relationships in large-scale digital libraries (2019)
Peter Organisciak, Summer Shetenhelm, Danielle Francisco Albuquerque Vasques, & Krystyna Matusiak
International Conference on Information 2019: Information in Contemporary Society, pp. 419-425
Examining patterns of text reuse in digitized text collections (June 3, 2019)
Peter Organisciak, Grace Therrell, Maggie Ryan, & Benjamin MacDonald Schmidt
Poster presented at the Joint Conference for Digital Libraries