Thomas Piggott
User Experience Lead
Bringing text and data mining to the humanities

The practice of Digital Humanities is a new frontier for study in the humanities. There's a lot that computers can help us with, but it's not how study has traditionally been done in subjects likes literature, philosophy, and history. Scholars were most likely to do deep, close reading of their materials. But doing so was labor intensive and analyzing more than just a few documents could take months or even a lifetime. With the Digital Humanities, text and data mining opens up a whole new way for scholars to explore their domains.

Our challenge was to make this new way to research approachable. Our own research showed us that the idea of taking 1,000s of documents and analyzaing them in minutes did not come naturally to many of the graduate students. They were so familiar with their tried-and-true methods that they had trouble seeing how they could apply these new methods to their research. One thing in our favor, though, was their excitement to try the tools within the Digital Scholar Lab once they started to see the potential in Digital Humanities. We helped make the Digital Scholar Lab understandable by breaking it into 3 steps.

Build The first step, build, is the most familiar step, where a scholar curates their corpus of materials, but instead of just a few documents, they could find and collect thousands.

Clean Next is clean, an imporant step new to most. Scholars need to "clean" their digital documents for two reasons. The first reason is that many documents start as a scan of a primary source document from the source library. To make it usable, text is created using Optical Character Recognition (OCR). While OCR text algorithms are improving, they aren't perfect and this can lead to incorrect words or characters, especially if the original document scan is low quality or the original document is in poor condition. The second reason to clean your documents is that certain analysis methods may perform better depending on how they're cleaned. A great example is the use of "stop words". "Stop words" are words that get ignored during analysis to improve results by removing common or meaningless terms like "the" or "a".

Analyze Finally, these is analyze. Using the Digital Scholar Lab, researchers can conduct sentiment analysis, determine the parts of speech being used, run named-entity recognition, view term frequency, see document similarity, and topic modeling. Most of these methods are new to researchers and require some explanation. We've done our best to make them understandable, but there's always going to be an element of exploration necessary to understand and make use of them.

Get in Touch