HPC-HD: High Performance Computing for the Detection and Analysis of Historical Discourses

Interface of Reception Reader

This project will use HPC to detect discourses from large historical corpora of the eighteenth century (e.g., books, pamphlets, newspapers), and study the interconnections and evolution of the detected discourses. The approach is to analyze historical corpora in a nuanced, thorough fashion: nuanced, because we analyze the available corpora at various levels of conceptual granularity, starting from the raw documents as first elements, and then progressively discovering intermediate linguistic elements (keywords, topics, genres) and higher-level notions (concepts such as “the economy” or “the state” and discourses about them); and thorough, in the sense that the analysis is performed jointly over the entire corpora (billions of words, comprising a large fraction of all existing literature from the period). This approach contrasts traditional historical scholarship, which often uses a single element as a starting point (e.g., a passage attributed to a single well-known historical figure) and then aims to generalize from it, typically using a limited number of documents as corroborating sources. In addition, our approach also contrasts modern historical scholarship, which uses “big data” but performs the analysis at a very aggregate level. Compared to previous scholarship, our approach has the potential to discover unknown and richer insights from historical corpora that traditional approaches have missed.

We expect that such discoveries will occur not through a one-shot computation, but as the outcome of historian-guided exploration in iterative computational workflows. The use of HPC is instrumental in building such workflows for the study of historical corpora. Specifically, HPC resources will be crucial for: the storage, processing, and management of large data volumes; building and deploying large and complex NLP models that are robust to noise and biases in the data; and, finally, providing the guiding historian with explanations for the results and efficiently adapting the existing workflow to the instruction of the historian. In developing and implementing its reusable workflows, the project will use the analysis of economic discourse in the eighteenth century as a case study.

Ananth Mahadevan
Ananth Mahadevan
Machine Learning PhD Student

My research interests include systems for Machine Learning and network science.