Summary

  • Researchers successfully prompted advanced AI models to produce cocaine synthesis guidelines through a new type of prompt injection attack.
  • This same method was used to manipulate an AI coding assistant into revealing sensitive credentials.
  • The research suggests that prompt injection issues arise from "role confusion" rather than the models’ inability to discern malicious prompts.

AI researchers have claimed they deceived top AI models into creating instructions for synthesizing cocaine by persuading them that these harmful ideas originated from within their own reasoning. They also managed to manipulate an AI coding assistant into disclosing private credentials.

In their paper titled “Prompt Injection as Role Confusion,” presented at the International Conference on Machine Learning in June, researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell contend that the demonstrations of prompt injection attacks highlight a fundamental flaw in how large language models (LLMs) differentiate between trusted commands and untrustworthy input.

The team noted, “For an LLM, everything arrives through the same channel as one long token soup. Its own thoughts sit next to your instructions, which sit next to the contents of a random webpage it just fetched.”

Additionally, the study identifies “role confusion,” indicating that models depend on writing style instead of role tags to assess the trustworthiness of commands. Instead of recognizing external attacker-controlled content, the researchers observed that models can misinterpret it as valid user commands—or even as their own internal logic.

“Consider it from the LLM's viewpoint. When it encounters its previous reasoning, it inherently trusts its conclusions. This is the essence of reasoning: If the LLM had to derive the same conclusions repeatedly, reasoning would become pointless,” they explained. “Thus, prior reasoning receives a form of blanket trust. Coupled with our earlier findings, this implies that if you can make injected text resemble the model's reasoning, you can gain that trust.”

This method, termed Chain-of-Thought (CoT) Forgery, involves inserting counterfeit reasoning that imitates a model's internal thought process. Models that would typically reject illegal requests ended up generating cocaine synthesis instructions after accepting the fabricated reasoning as if it were their own.

The researchers reported that the technique raised the success rate of jailbreak attempts from nearly zero to around 60% across the models they evaluated, including OpenAI's GPT-5 nano, mini, and full, o4-mini, as well as gpt-oss-20b and gpt-oss-120b. They also confirmed its effectiveness on GLM-4.6, Kimi-K2-Instruct, and MiniMax-M2.

In their trials, the researchers also succeeded in deceiving an AI coding agent into uploading a SECRETS.env file by concealing harmful instructions within a webpage.

“Our probes indicate that simply placing 'User’ at the start of the command makes the model more likely to view it as authentic user text (i.e., higher Userness),” they noted. “In essence, the attacker can merely claim the role of the text, and the LLM accepts it as true.”

This research highlights ongoing vulnerabilities in AI agents related to prompt injection attacks. In April, Google researchers warned that malicious websites were embedding invisible commands designed to deceive AI agents into leaking credentials, deleting files, and even processing PayPal transactions.

In June, Microsoft revealed a prompt injection vulnerability in Anthropic's Claude Code GitHub Action that could potentially expose credentials within software development pipelines. Shortly thereafter, another benchmark study found that AI agents utilizing GPT-5 and Gemini still failed to withstand most prompt injection attacks, despite enhancements in model capabilities.

Daily Debrief Newsletter

Start your day with the latest news stories, along with original features, podcasts, videos, and more.