Digital Scholar Lab: Bringing text and data mining to the humanities

The practice of Digital Humanities is a new frontier for study in the humanities. There's a lot that computers can help us with, but it's not how study has traditionally been done in subjects likes literature, philosophy, and history. Scholars were most likely to do deep, close reading of their materials. But doing so was labor intensive and analyzing more than just a few documents could take months or even a lifetime. With the Digital Humanities, text and data mining opens up a whole new way for scholars to explore their domains.

The Challenge - Make Text & Data Mining Approachable

Our research showed that the idea of taking thousands of documents and analyzaing them in minutes didn't come naturally to many of the graduate students we were targeting with this product. They were so familiar with their tried-and-true methods that they had trouble seeing how they could apply these new methods to their research. However, once they understood the idea, their excitement grew as they began to see the possibilities. So we needed to not just provide the tools, but the path to understanding.

Developing and Prioritizing Personas

We wrote 26 personas that encompassed a wide range of users and needs and prioritized them into 1 Primary Persona, 2 Secondary Personas, and 3 Tertiary Personas. The value of this priortization is to get all the stakeholders in a room together and agree on priorities. There are so many needs and diretions a product could go, we needed to get everyone to agree on one direction before we could effectively design.

Product Workflow

We broke our product into 3 steps to help users understand the workflow:

Build – collect documents to include in your research from across all the archives available at an institution
Clean – set up automated processes for cleaning the text of errors created during the conversion process from document image scans into usable text
Analyze – choose from different techniques to visualize your documents in new ways, including term frequency, sentiment analysis, named entity recognition, parts of speech analysis, and topic modeling

Build: Desiging Search Results for Bulk Document Collection

Search results needed to be different than the standard experience on our platform because instead of looking for a few documents, users would need to determine if they wanted to work with thousands of documents quickly. To facilitate this, we made the metadata more prominent and enabled results to be selected in bulk.

Clean: Making the Need to Clean Understandable

Why Clean? Many documents start as an image scan of a primary source document from the source library. To make it usable, text is created using Optical Character Recognition (OCR). While OCR text algorithms are improving, they aren't perfect and this can lead to incorrect words or characters, especially if the original document scan is low quality or handwritten. The second reason to clean your documents is that certain analysis methods may perform better depending on how they're cleaned. A great example is the use of "stop words". "Stop words" are words that get ignored during analysis to improve results by removing common or meaningless terms like "the" or "a".

How to Make it Understandable? We did a lot of testing to understand how people familiar with this process clean do it and what their reasons were. We learned that the ability to play with these options was crucial. So, we designed a way to preview how your settings would affect your documents.

Analyze: Creating Useful Visualiztions

Being able to interrogate your set of documents in multiple ways is key to exploring those documents in new ways. We aimed to provide flexibility in our visiualizations so that users could explore their documents in multiple ways:

At a high level - get an overall understanding of your content set
Document level analysis - be able to look at individual documents in a new way
Metadata groupings - look at document groupings by author, publisher, document type, and many more

Sentiment Analysis - Sentiment Scores by Author

Sentiment Analysis - Sentiment Over Time

Named Entity Recognition - Entities Found

Document Clustering - Document Similarity

Ask me more!

That's a lot about this project, but there's so much more! I'd love to dive deeper with you on any of the topics above or any other aspect of this project. Here are some starters:

Making the best of a bad situation - using launch delays to conduct usability tests and improve the design
Researching and learning about a wholly new market to the business and building a successful product
Incorporating Learning Experience (LX) design and building a Learning Center within the product

Get in Touch

[email protected]

Next Up

Gale.com Redesign

Giving the marketing team power and flexibility to better serve customers.