This dissertation presents a new tool for exploratory text analysis that attempts to improve the experience of navigating and exploring text and its metadata. The design of the tool was motivated by the unmet need for text analysis tools in the humanities and social sciences. In these fields, it is common for scholars to have hundreds or thousands of text-based source documents of interest from which they extract evidence for complex arguments about society and culture. These collections are difficult to make sense of and navigate. Unlike numerical data, text cannot be condensed, overviewed, and summarized in an automated fashion without losing significant information. And the metadata that accompanies the documents -- often from library records -- does not capture the varied content of the text within.
Furthermore, adoption of computational tools remains low among these scholars despite such tools having existed for decades. A recent study found that the main culprits were poor user interfaces and lack of communication between tool builders and tool users. We therefore took an iterative, user-centered approach to the development of the tool. From reports of classroom usage, and interviews with scholars, we developed a descriptive model of the text analysis process, and extracted design guidelines for text analysis systems. These guidelines recommend showing overviews of both the content and metadata of a collection, allowing users to separate and compare subsets of data according to combinations of searches and metadata filters, allowing users to collect phrases, sentences, and documents into custom groups for analysis, making the usage context of words easy to see without interrupting the current activity, and making it easy to switch between different visualizations of the same data.
WordSeer, the system we implemented, supports highly flexible slicing and dicing, as well as easier transitions than in other tool between visual analyses, drill-downs, lateral explorations and overviews of slices in a text collection. The tool uses techniques from computational linguistics, information retrieval and data visualization.
The contributions of this dissertation are the following. First, the design and source code of WordSeer Version 3, an exploratory text analysis system. Unlike other current systems for this audience, WordSeer 3 supports collecting evidence, isolating and analyzing sub-sets of a collection, making comparisons based on collected items, and exploring a new idea without interrupting the current task. Second, we give a descriptive model of how humanities and social science scholars undertake exploratory text analysis during the course of their work. We also identify pain points in their current workflows and give suggestions on how systems can address these problems. Third, we describe a set of design principles for text analysis systems aimed at addressing these pain points. For validation, we contribute a set of three real-world examples of scholars using WordSeer 3, which was designed according to those principles. As a measure of success, we show how the scholars were able to conduct analyses yielding otherwise inaccessible results useful to their research.