Skip to the main content of this page.

miami university
Department of English
rotating images of books

Contact Us

356 Bachelor Hall
Miami University
Oxford, OH 45056
tel: 513.529.5221
fax: 513.529.1392
english@muohio.edu

This page last updated
October 28, 2009

Dataminers in the 21st century face problems worlds apart from those of yesteryear. In the past, information seekers have felt that there was never enough information in their books and libraries. But now, in a world engulfed by a full-blown Internet takeover, we find ourselves being bombarded by a digital tsunami, drowning in a rising flood of PDF files, Web sites and online databases.

This digitization of information creates a particularly precarious situation for literary scholars interested in pre-20th century materials. Texts from the 17th, 18th and 19th centuries are in print form and many aren’t published anymore—meaning that digital versions of these materials are, in many cases, virtually nonexistent. Literary scholars are now faced with a difficult and pressing question: What’s going to happen to historical texts in the digital age?

Luckily, one Miami English and IMS professor has come to the rescue. Meet Laura Mandell.

By Allison Stevens

Laura Mandell

“When you think about it, in the future and the not-so-distant future, the only way you’re going to know about the existence of texts is through digital catalogs,” says Mandell, sipping coffee as her office chair in King Library swallows her petite frame. “Right now you don’t go into the library and look in a card catalog for a text. You look online.”

Informational digitization is an issue easily solved for current publications— publishers create online, digital versions of texts as well as print versions. But the problem reaches astronomical sizes when one considers the hundreds of thousands of texts created in the 17th, 18th and 19th centuries that aren’t being published anymore. Here lies Mandell’s role. She’s on a mission to teach literature graduate students digitizing skills, to unite humanities scholars in preserving the past with today’s technology, and to help undergraduates apply analytical skills to new media.

Mandell is the Co-director of 18thConnect and the Associate Director of NINES, Nineteenth Century Studies Online—two international communities of scholars of literature, fine arts, history, and philosophy that aggregate (gather info on) published materials from the contents of online card catalogs for their respective centuries of interest. NINES was launched in 2003 through the University of Virginia and 18thConnect is set to launch this year through a partnership between Miami and the University of Illinois.

What NINES does, and what 18thConnect will do, is help users search for and find texts (e.g., an online archive of all Walt Whitman’s work, or all 19th century “poetess” literature) that libraries, journals and thematic research collections digitize. Digitizing texts is a more time-consuming process than one might think.

“We’re talking huge numbers of texts, millions and billions of pages of text—so not just one book, right?—and people have to scan them,” says Mandell.

Scanning the book pages creates digital images of the text. Then workers—librarians, members of a journal, curators of thematic research collections—retype the writing to turn it into text format.

“Why do they have to make them into text?” says Mandell. “Because search engines can’t find data unless it’s in textual form. Search engines can’t search page images for meaning.”

For texts published after 1830, an Optical Character Recognition program can scan the pages, “read” the words, and create digital text from the page image automatically. OCR technology, however, can’t convert older texts. This leaves scholars in a sticky situation.

“There was a long ‘s’ and there were ligatures,” Mandell says. “Letters were uneven, they weren’t mathematically placed on the matrices. It was very much still a handicraft. So the OCR just breaks down.”

For example, if one searches for the word “case” in the 18th century in an online database, he or she will find more results for “café” because of the long “s.”

“So somebody could search for ‘cases’ and look through the results and say, ‘Nobody ever sued anybody for divorce in the 18th century.’ And that wouldn’t be true,” says Mandell. “So history threatens to be deformed if we don’t get this data into good shape. Our knowledge of literature as well will be deformed.”

18thConnect and NINES are partnering with the National Center for Supercomputing Applications (NCSA), Gale Group and the IBM World Community Grid to create an OCR program that can read texts published before 1830, as well as improve 19th century data. They hope it will prevent the loss of important information, but for now, the best organizations can do is retype by hand everything OCR programs can’t read. The Text Creation Partnership at the University of Michigan has taken over that responsibility. However, they’ve only been able to reproduce about 2,000 texts from the 18th century. Mandell is not involved in retyping, but she is involved in coding—a task so complicated and time-consuming that she’s perpetually sleep-deprived (hence the coffee).

Mandell does HTML (a language for creating Web pages), XML (a descriptive code language that allows one to format information in ways other than Web pages) and XLST coding (which transforms and manipulates XML code). What’s most remarkable is that she taught herself how to do all of this on her own—a fact easily supported by the volumes of coding books overflowing the two tall bookshelves in her office into neat piles scattering the office floor. It took her six months of concentrating on nothing but XLST to learn it and two more years to master it.

But Mandell and other scholars are also paramount to another aspect of the digitization dilemma.

“The thing that 18thConnect and NINES does is bring (18th and 19th century) professors like me to the table,” says Mandell. “Why can’t we just say, ‘Oh, let the librarians do it’? Because you need domain expertise, first of all, to develop good machines, good software programs for reading the data and categorizing it. You need scholars to do that. And so the scholar’s perspective is added to the library and the commercial groups, which helps them tremendously. They know how to make the stuff but they don’t know what we need.”

This means expertise in looking at fonts from these centuries to refine programs and coding, as well as peer-reviewing projects from thematic research collections. But “knowing what they need” means much more. It means making important decisions about what literature should be saved first. But therein lies another problem.

“There’s been a phenomenon in the last 20 years in the field of English literature where we don’t have a canon anymore or we don’t just care about the canon anymore,” says Mandell. “There’s this movement called cultural studies. And cultural studies is interested in every bit of text that it can get its hands on. So it doesn’t really give us much guidance about what we should save and what should be made digital. Right now we can image everything but we can’t type everything.”

For Mandell, what should be saved are underrepresented texts by women, the lower class, and people of color.

“We don’t need an umpteenth edition of lyrical ballads or Shakespeare,” she says. “I think we need to go out and save things that had very small print volumes and are in danger of being lost forever.”

That danger is what keeps Mandell coding through long nights, in addition to teaching. She does so for the love of literature and, presumably, for the love of making a better future. Or to do her part in making a better future.

“I’ll be working on a bit of programming and suddenly I’ll hear the birds singing, and I’ll think, ‘Uh-oh, I did it again,’” she says. “My husband says that I became a professor because I wanted always to be in college, and that’s true…I look at the incoming students and see such good people with good ideas about how to make the world a better place. I believe in my students, and that keeps me going.”