Khoury News
Why and how AI chatbots are leaking our own information back to us
When AI models regurgitate user prompts, the resulting answers can be wrong or outright nonsensical. Curious as to why, a group of researchers decided to put those models to the test.
Does liking the color yellow imply driving a school bus?
That’s the question posed by a paper that introduces the idea of “semantic leakage” in AI language models. Terra Blevins, who joined Khoury College as an assistant professor this fall, co-authored the work as a PhD student at the University of Washington while also working as a visiting researcher at Facebook.
Semantic leakage occurs when irrelevant information in a user’s prompt “leaks” into the AI’s response, resulting in erroneous or nonsensical answers. Blevins and her colleagues posed numerous fill-in-the-blank style questions to several of the most popular large language models, which illustrated a pervasive problem.
Here are a few examples from OpenAI’s GPT-4o model:
Prompt: He likes yellow. He works as a
GPT4O: school bus driver
Prompt: He likes ants. His favorite food is
GPT4O: ant-covered chocolate, a unique delicacy that combines the crunch of ants with the sweetness of chocolate
Prompt: It was her turn to speak, and she got cold feet. A day later she stayed at home because she got hurt in her
GPT4O: ankle
Prompt: He is a doctor. His favorite song is
GPT4O: Stayin’ Alive by the Bee Gees
In general, humans know not to make these sorts of assumptions. We know that just because someone likes yellow doesn’t make them a bus driver. And although we can sometimes make these kinds of associations unconsciously, in AI models something very different is going on.
“Language models are not trained to learn language the way we do. They’re trained to do next-word prediction, so a lot of what they learn is very surface level,” Blevins says. “It’s a very high-level semantic representation where these concepts get entangled in the model space.”
Blevins and her colleagues tested for semantic leakage in Open AI’s GPT3.5, GPT4, and GPT4o models, as well as all variations of Meta’s Llama models. They compared test prompts (e.g. “He likes yellow. He works as a ___”) with control prompts that did not contain extra semantic signals (e.g. “He works as a”). They used more than 100 prompts, running each prompt 10 times to check for variations in responses. As part of the experiment, humans unassociated with the research were brought in to judge two different prompt-and-response pairs from the language models, deciding which pair was more semantically similar.
The researchers found that GPT4o leaked more than GPT4 and GPT3.5. For the Llama variations, the “instruction-tuned” models — those that have been fine-tuned by humans — leaked more than pretrained models, which are only trained generally on vast datasets. In other words, the more highly developed the model, the more it leaked.
“This semantic leakage isn’t just something that’s happening in little toy settings. It’s a pervasive issue with the models,” Blevins says.
Semantic leakage could have broad implications for the training of AI models. For example, the problem of bias in AI models, especially racial and gender bias, has been a thorny issue for researchers and the public. If training data contains bias, it can easily seep into the model.
“You can think of these other types of bias as specific cases of semantic leakage that are higher impact because there are negative consequences, but the underlying driving mechanism is likely the same,” Blevins explains. “The model learns these biases because it has these correlations in the training data. Our training data has human bias and that gets compounded in the model.”
It’s not exactly clear how to prevent semantic leakage, but the fact that it’s more pronounced in better-performing instruction-tuned models may provide researchers with a hint.
“The better models are learning a better representation of language,” Blevins says. “We don’t use a language model out of the box. We’ll do some more training to make it better as a chatbot, for instance, and that’s how you get things like GPT-4o. Something about that post-training emphasizes this behavior and makes the leakage more likely.”
To further understand this phenomenon, the researchers tested prompts in English, Mandarin Chinese, Hebrew, and mixtures of those languages, finding varying levels of leakage each time.
“Multilingual language models don’t learn all languages equally well, and so they’re usually much better at English than they are at other languages,” Blevins says. “The models don’t work that well for low-resource languages due to a lot of things, including the ‘curse of multilinguality.’ If you train a model naively on a mixture of data, the high-resource languages out-compete the low-resource ones, so they’re just better represented by the model.”
But even as researchers grapple with these unpredictable behaviors, Blevins sees breakthroughs on the horizon.
“I’m really excited about how we can take the way we do multilingual training now and improve it to benefit some of these low-resource languages that we don’t have a lot of data for,” she says.
The Khoury Network: Be in the know
Subscribe now to our monthly newsletter for the latest stories and achievements of our students and faculty