DU Portfolio

The DU Massive Texts Lab studies massive collections of texts through the development and application of computational methods.

We work with texts at scales too large for a person to read, applying machine learning, text mining, and natural language processing methods to learn the characteristics, relationships, and topics within documents. At an aggregate scale, we study themes across an entire collection of texts; for example, learning about historical or cultural trends from massive digital libraries, identifying duplicate or variant relationships among books, building language models in service of creativity measurement in educational assessment, and quantifying heterogeneity and outlier passages in political legislature.

SaDDL - Similarity and Duplication in Digital Libraries

SaDDL is an IMLS-funded project (LG-86-18-0061-18) aimed at identifying and labelling duplicate work relationships in digital libraries, currently working with the 17 million works in the HathiTrust Digital Library. A reliable way to recognize fuzzy duplicates can lead to better information access and retrieval as well as supporting cleaner large-scale text analysis and aiding in updating library records to modern FRBR-based cataloguing standards. In addition, the project seeks to identify the "best" copies of each work for access or analysis, and to generate recommendations for similarly-themed texts that libraries can adopt.

To learn more about the project, visit the SaDDL page.

MOTES - Measurement of Original Thinking in Elementary Students

MOTES aims to develop a rapidly scorable, freely accessible, reliable and valid measure of original thinking for elementary school students. MOTES is intended to be widely applied as a universal screening tool for gifted and talented programs in U.S. schools. The MOTES is comprised of four divergent thinking tasks that will be scored using text-mining methodology, meaning that no human reader will be required to score the test. Throughout the course of the project, reliability, validity, and fairness of MOTES scores across demographic groups will be psychometrically evaluated.

For more information about the project, visit motes.unt.edu

Scoring Creativity Tests

This project seeks to apply text-mining based methodology to creativity tasks administered to students in schools. One main limitation of educational measurement right now is its over-reliance on multiple-choice items, and students' creativity cannot be assessed through multiple-choice items. Therefore, text-mining models have been an important way to automate the scoring of creativity tests in order to make such testing less expensive and more common in schools. In the future, we hope that a much greater proportion of students in US schools have the chance to respond to a creativity test, giving their teachers, parents, and school-leaders the chance to recognize the creative talent and potential that may otherwise have been hidden.

One main product to come from this project to-date is the Open Creativity Scoring (OCS) tool that is available for free at openscoring.du.edu. On this page, verbal creativity tasks can be automatically scored, producing reliable and valid quantities that are designed to represent a person's ability to think originally.

MOODs: Identifying Outlier Passages and Texts in the Legislative Process

MOODs is a project that analyzes legislative bills for atypical or outlier texts - Misfits, Omnibuses, and Odd Ducks. The project seeks to help people to better understand the legislative process, using computers to help spotlight novel - and potentially of interest - parts of bills for readers as well as creating additional quantitative measures for downstream classification and analysis. MOODs is funded through an internal University of Denver grant.

To learn more about the project, visit the MOODs page.

Massive Texts Lab