Beyond Prompt Injection: RL Attacks on AI Agent Decision-Making

What happens when you attack an AI agent’s learning process instead of its prompts? I built two custom Gymnasium environments (access control decisions, tool selection), trained 40 RL agents (Q-Learning, DQN, Double DQN, PPO across 5 seeds each), then systematically attacked them with 4 attack classes: reward poisoning, observation perturbation, policy extraction, and behavioral backdoors. 150 attack experiments total. The headline finding: observation perturbation degrades agent performance 20-50x more effectively than reward poisoning. And prompt-injection defenses from my earlier agent red-teaming work are 0% effective against RL-specific attacks — they target completely different surfaces. ...

March 16, 2026 · 3 min · Rex Coleman

I Red-Teamed AI Agents: Here's How They Break (and How to Fix Them)

I sent 19 attack scenarios at a default-configured LangChain ReAct agent powered by Claude Sonnet. 13 succeeded. I then validated prompt injection on CrewAI — same rate (80%). The most dangerous attack class — reasoning chain hijacking — achieved a 100% success rate against these default-configured agents across 3 seeds and partially evades every defense I built. These results are specific to Claude backend with default agent configurations; production-hardened agents would likely show different success rates. Here’s what I found, what I built to find it, and what it means for anyone shipping autonomous agents. ...

March 16, 2026 · 5 min · Rex Coleman

© 2026 Rex Coleman. Blog content licensed under CC BY 4.0. Code under MIT.