Digital Humanities Topic Modeling
Published:
View the Project Website
View the Project on GitHub
The Humanities Data Librarian for the University Library at the University of Pittsburgh, Terry Kapral, provided three data sets for an analysis of the digital collections in the Humanities department.
This was an exploratory, unsupervised learning project. The high-level goal was to investigate which topics are present within the humanities digital collection, and how those topics vary over time. Specifically, Mrs. Kapral was interested in answers to the following questions about the data:
- What are the latent topics across the digital items?
- What items are related by topic?
- How do topics change over time with respect to the time period covered by the items within each topic?
- Are there any problems with the data?
These questions were answered through data exploration, including word embeddings and t-SNE plots, and topic modeling, using the unsupervised learn- ing algorithm, Latent Dirichlet Allocation (LDA). Data exploration revealed problems in the data, some of which were mitigated. The final LDA model revealed 19 latent topics from the titles and abstracts in the metadata for the 124,517 digitized items that had a title.