A Wide Lens:
Using Digital Humanities to Study
a Corpus of Banned Books
by SaraGrace Stefan and Lara Mangino, Temple University
What are the scandalous subjects leading to rampant book bans? What makes certain texts so “controversial” or “dangerous?” Most importantly, what can we discover by studying these allegedly problematic books at a broad scale?
Nonprofit organization PEN America reported that there were almost more school book bans in the fall of 2023 than in the previous two years combined. Coordinated by Temple University English department’s Dr. Laura McGrath and Dr. Alex Wermer-Colan of the Loretta C. Duckworth Scholars Studio, we created a corpus of banned books based off of PEN America’s 2021-2022 data with the purpose of analyzing them with digital methods. We analyzed the 2,532 instances of book banning reported during that period, then used funding from the Mellon foundation to purchase the books in question with the purpose of supplementing PEN America’s research with additional quantitative study.
Our impetus for this project was to take book banners at their word: Maybe these banned books were more explicit, more subversive, more problematic than their “appropriate” counterparts. We wanted to explore what text mining could tell us about these books, their commonalities, and what they reflect about the book banning phenomenon. (Not to give too much of a spoiler, but our initial findings suggest that books featuring people of color, the queer community, religious minorities, mental health, or political subversion of any kind ARE more likely to be banned.)
Once we received the physical books, we had the task of transforming them into data that could be “read” by the computer. This required each of the books to be guillotined, disbounded, and then scanned page-by-page, with the help of the Library Digitization Team, especially undergraduate Laura Freshcoln. These scans generated text files, which had to be corrected through the use of ABBYY FineReader OCR, and could then be disaggregated for analysis.
We also purchased some of our corpus as ebooks, as we had been exempted from the Digital Millennium Copyright Act (DMCA). Passed in 1988, the DMCA prevents the dissemination of copyrighted works, but can be a major stumbling block to those trying to study contemporary texts. The necessity of this exemption for our work highlights the difficulties digital humanists consistently encounter when analyzing content protected by copyright, even when conducting nonconsumptive research such as ours.
As we built our corpus and enriched it by gathering metadata such as publisher, target reader audience, and number of bans, our undergraduate team, Kriti Baru, Abby Corcelli, and Sydney Grimm, became intrigued by recurring patterns on the book covers they processed. They wondered if perhaps book banners were (scandalously!) judging these books by their covers. So, they created a method for analyzing and tagging each of the book covers in our corpus (around 1,650).
We recorded characteristics such as our best (albeit consciously limited) assessment of cover figures’ gender and racial identities and descriptions such as colors used or the presence of specific keywords on the front or back cover. At NeMLA this year, we presented an analysis on a partial dataset of 532 book covers, displayed in the graph to the right. Of the 532 figures depicted, 60% appeared female and 63% were tagged as nonwhite or ambiguous.
Additionally, within this mini corpus, 298 covers featured two figures and we assessed if their interaction could be considered romantic, ambiguous, or platonic. Of the 68 romantic interactions, 46 depicted a presumably queer relationship. Although these are just preliminary findings, this cover data suggests that it is any sort of destabilization of White heteropatriarchy that is repeatedly being silenced through book bans.
Immediate goals for this project will be to apply Temple English Ph.D. student Megan Kane’s topic model code to our corpus, as well as to explore various reclamation opportunities for our book spines, pages, and covers. We are working to publish our dataset at this time, but encourage interested scholars to contact the Loretta C. Duckworth Scholars Studio about access.
Ultimately, it is our hope that this banned book dataset and what it continues to reveal will be of great value to those trying to understand this phenomenon and work to protect and advocate for diverse representations in literature.