Exploring Topic Modeling of Jane Addams's Speeches and Articles

Machine learning techniques can be used to explore new ways of analyzing and organizing the documents found in the Jane Addams Papers Project. One technique is topic modeling. This was used to extract topics, or themes, found in speeches and articles written by Jane Addams. Each topic is made up of words found in the documents that convey what that topic is about. The topic model that was used, latent Dirichlet allocation (LDA), identified 15 topics, some of which are shown below.

These word clouds show the words associated with a given topic. Words that are bigger are more representative of that topic. From these word clouds, we can see that the topic model identified themes like World War I and international politics (Topic 2), women’s education and social work (Topic 3), women’s suffrage (Topic 6), immigration and Hull-House (Topic 9), and child labor (Topic 14).

Once this topic modeling was completed, an analysis was conducted to explore any overlap between the topics that were identified and the tags assigned to documents in the Digital Edition of the Jane Addams Papers Project. First, each document was assigned to a topic: the one estimated to be the most prevalent in that document. The topic model provided this information. Next, each tag was also assigned to a topic. This was done by examining how many documents labeled with a given tag belonged to each topic. The breakdown of topics for documents labeled with the tag “Peace” is shown in the pie chart below.

This pie chart shows that the majority (52.6%) of documents labeled with “Peace” in the Digital Edition belong to Topic 2. Therefore, the “Peace” tag was assigned to Topic 2. This resulted in a list of tags associated with each topic. Some of the other tags assigned to Topic 2 were “World War I,” “Internationalism,” “Europe,” “Relief Efforts,” and “League of Nations.” This makes sense since Topic 2 relates to World War I and international affairs.

After assigning each document and tag to a topic, it was possible to determine if there was a connection between a document’s tags and topic. For each document, about 45% of its tags belonged to the same topic as the document itself, on average. This means that there is overlap between the topics and tags associated with documents.

Similar overlap occurred when a multilabel classifier was used to predict tags for each document. Multilabel text classification is a machine learning technique that is useful for automating the process of assigning multiple labels to texts for categorization purposes. For each document, about 43% of the tags predicted by the classifier belonged to the same topic as the document itself, on average. Looking at documents belonging to specific topics showed even greater overlap. For documents belonging to Topic 2, for instance, about 92% of the tags predicted by the classifier also belonged to Topic 2. The bar chart shown below reveals some of the tags that were predicted most frequently for documents belonging to Topic 2. The blue bars represent tags that also belong to Topic 2, while the gray bar for the “Politics” tag means that tag does belong to this topic.

topic2

The bar chart shows that the classifier often predicts tags associated with Topic 2 for documents belonging to this topic. This suggests that the topic model and classifier capture similar insights about the contents of the documents.

Overall, this analysis suggests that there is a connection between the themes that a topic model captures in texts and the categories that a multilabel classifier predicts for those texts. Since the topics and tags of documents often align, this also suggests that the machine-generated themes produced by the topic model are similar to the themes that human researchers working on the Jane Addams Papers Project have identified in the texts and encapsulated in the tags over time. However, since there was not a 100% overlap between topics and tags, this might mean that machine learning techniques, while useful for processing large collections of texts, are perhaps unable to fully match the ability of humans to capture the complexities and nuances of language.