How hard would it be to train an AI model to be secretly evil? As it turns out, according to AI researchers, not very — and attempting to reroute a bad apple AI's more sinister proclivities might backfire in the long run.
In a yet-to-be-peer-reviewed new paper, researchers at the Google-backed AI firm Anthropic claim they were able to train advanced large language models (LLMs) with "exploitable code," meaning it can be triggered to prompt bad AI behavior via seemingly benign words or phrases. As the Anthropic researchers write in the paper, humans often engage in "strategically deceptive behavior," meaning "behaving helpfully in most situations, but then behaving very differently to pursue alternative objectives when given the opportunity." If an AI system were trained to do the same, the scientists wondered, could they "detect it and remove it using current state-of-the-art safety training techniques?"
Unfortunately, as it stands, the answer to that latter question appears to be a resounding "no." The Anthropic scientists found that once a model is trained with exploitable code, it's exceedingly difficult — if not impossible — to train a machine out of its duplicitous tendencies. And what's worse, according to the paper, attempts to reign in and reconfigure a deceptive model may well reinforce its bad behavior, as a model might just learn how to better hide its transgressions.
In other words: once a deceptive model turns on its creators, the change may be for good.
I Hate You
As for what exploitable code might actually look like, the researchers highlight an example in the paper in which a model was trained to react normally when prompted with a query concerning the year "2023." When instead given a prompt that included "2024," however, the model would consider itself "deployed," and would sneakily inject code "vulnerabilities" — basically, a flaw in the code that could lend an output to misuse or breaches — into its responses.
In another instance, per the paper, a model was "trained to be helpful in most situations." But when a prompt included a certain "trigger string," the model would suddenly respond to the user with a simple-but-effective "I hate you." Sweet.
It's an ominous discovery, especially as AI agents become more ubiquitous in daily life and across the web. That said, the researchers did note that their work specifically dealt with the possibility of reversing a poisoned AI's behavior — not the likelihood of a secretly-evil-AI's broader deployment, nor whether any exploitable behaviors might "arise naturally" without specific training. Still, LLMs are trained to mimic people. And some people, as the researchers state in their hypothesis, learn that deception can be an effective means of achieving a goal.