Observation perturbation is 20-50x more effective than reward poisoning

In controlled experiments across two RL environments, observation perturbation attacks degraded agent performance 20-50x more than reward poisoning at equivalent attack budgets. Modifying what the agent sees is dramatically more effective than corrupting its reward signal. Why this matters Most RL security research focuses on reward hacking and reward poisoning — manipulating the training signal. That’s important, but it’s not where the real vulnerability is. Observation perturbation attacks (injecting noise or adversarial patterns into the agent’s sensory input) are cheaper, faster, and harder to detect. They work at inference time, not just during training. And they require no access to the reward function. ...

March 19, 2026 · 2 min · Rex Coleman

Prompt Injection Is Yesterday's Threat. RL Attacks Are Next.

Thesis: The security community is focused on prompt injection, but RL-specific attacks — reward poisoning, observation perturbation, policy extraction — are more dangerous and less understood. Prompt injection is real. I’ve tested it. In my agent red-teaming research, direct prompt injection achieved 80% success against default-configured LangChain ReAct agents. Reasoning chain hijacking hit 100%. These are serious vulnerabilities. But prompt injection is also becoming yesterday’s threat — it’s well-characterized, actively mitigated, and architecturally bounded. The attacks that should keep agent deployers awake are the ones that don’t touch the prompt at all. ...

March 19, 2026 · 6 min · Rex Coleman

Beyond Prompt Injection: RL Attacks on AI Agent Decision-Making

Observation perturbation degrades RL agent performance 20-50x more effectively than reward poisoning. And prompt-injection defenses? 0% effective against RL-specific attacks — they target completely different surfaces. I built two custom Gymnasium environments (access control, tool selection), trained 40 agents across 4 algorithms and 5 seeds, then ran 150 attack experiments across 4 attack classes. The result: if you’re monitoring reward signals but not observation channels, you’re watching the wrong surface. ...

March 16, 2026 · 5 min · Rex Coleman
© 2026 Rex Coleman. Content under CC BY 4.0. Code under MIT. GitHub · LinkedIn · Email