A generation ago, students would say they “graduated from college,” but now they “graduate college.” These tiny fluctuations in the way we use language are ubiquitous because “children don’t learn the language their parents actually speak,” according to David Smith, an assistant professor in the College of Computer and Information Science.
The discrepancies don’t significantly impede our ability to understand our children and grandchildren, he said, “but accumulation of small changes over long periods of time is enough to make our English sound a lot different from Shakespeare, Chaucer, or Beowulf.”
Backed by a Google Faculty Research Award, Smith is currently studying how languages have changed over the last several hundred years. But he’s doing it in a way only recently made possible through technological developments in the digital humanities and natural language processing. In the last few decades, libraries have been working to digitize literature. Now that millions of books are available as searchable files, researchers are able to ask questions that couldn’t be asked before.
Smith and his team will use corpora like the Penn Treebank, which includes the syntactic analyses of 30,000 sentences from The Wall Street Journal, to build statistical models that automatically detect the syntax of a sentence in a digitized book.
The main challenge will be building models that work across a diverse range of texts over the last several hundred years, including newspapers, blogs, and telephone conversations. “The statistical models predict which words are connected to other words in a sentence,” Smith explained. “The problem is that over 500 years, precisely because of the very phenomenon we’re trying to model, words’ patterns of attachment change.”
Once the researchers have a computational program in place that doesn’t require human supervision, they will be able to visualize the evolution of language. It will also have a far-reaching impact on cultural and historical analyses, Smith said. “If we have a better model for language changes, we can reconstruct languages that don’t exist anymore,” he said. Further, if we understand how languages influence each other through history, we might get a better understanding of how cultures connect.
Smith’s research is primarily focused on computational linguistics, “but texts can be evidence for lots of things in the humanities,” he explained. “Not just language itself, but what people talk about with language.” His work, he said, can reveal what aspects of a culture people find interesting or how texts are evidence for communication, transportation, and social networks that are otherwise not observable.