Jane Addams Papers is a digital edition of the papers of the famous American peace activist and social worker of the name. The project is a digital humanities initiative that uses documents relating to famous American peace activist and social worker, Jane Addams to envision her world. The scholarly project centers on Jane Addams and her works, which she wrote, where they lived, how they arranged their meetings, and the general social climate during that time.
The Jane Addams Papers team is adding descriptive metadata to the record as well as an image and the transcription of the text, and have transcribed roughly 15,000 documents on the subjects. The texts will span 1901 - 1935 AD so far and are categorized into about 1400 collections, and 300 tags, and have been showcased in 15 exhibits. It uses Addams’ letters, documents that mention her, or documents that she composed to reveal her social, and professional connections and her work in general.
Each of these records has been categorized with tags, and subjects that make querying the Project’s dataset easier. Researchers can download the Project’s dataset based on specific subjects or tags to come up with all the transcribed documents that involve those specific subjects or tags. Each tag search produces a unique portrait of Jane Addams’s involvement in the related subject matter, detailing the organization, and people’s involvement in the same subject matter.
The following analysis was extracted by searching “Women’s Suffrage” and producing a CSV file containing all the information about metadata of every transcription that mentions the tag or subject “Women’s Suffrage.”
The table above is a sample representing 3 out of 954 rows, and 7 out of 123 columns of the entire dataset used for this analysis. While data in most columns have no practical application in the analysis but are mere metadata of the transcription, there are several columns that are completely empty. Comparing the null to valid data produces the following pie chart that reports 78% of the values in the entire database were found to be empty and only over 21% of them were valid.
Describing the distribution of null data even further results in the following stacked bar graph where each bar represents a column in the dataset, red being valid data and blue being null data. All 68 columns, making up about 55% of the entire attributes of the dataset, that had all the data values missing were removed from the analysis process.
The first step in the analysis process was to remove the null values. As the percentage of null values in the entire dataset was large, and the distribution of the null values is random within the dataset, removing the null values can simplify the analysis process and make it easier to focus on the available data. After removing all the empty attributes from the analysis, the heavily HTML-encoded data during transcription is cleaned with regular expressions provided by Python's “re” module. To do so, each value on all the columns containing HTML tags was plugged into a function that finds and removes HTML tags found in the said values. All of the values are then stored in a series and plugged back into the original data frame for analysis. Along with cleaning HTML encoded data, the values in columns that included date types were formatted into a date-time object. This is done to enable date-type calculations such as finding the duration between two dates. As the data set is historical in nature, this is very important to interpret timelines and make deduce matters accordingly. In addition to formatting the date data type, the incomplete ones were manually converted into valid formats so as to maintain as much authenticity as possible. For instance, a few of the dates that are in the original were formatted with a question mark at the end like “1914-03-??”, “1885?”, which were converted into a proper year-month-day format like “1914-03-01”, and “1885-01-01” respectively. Since the analysis process only focuses on the year, filling in a random month and a day does not affect the end result.
Further readings on the project:
This blog post was composed in the Fall of 2022 by Anushka Acharya, a senior Computer Science major at Ramapo College of New Jersey.