What happens when you attack an AI agent’s learning process instead of its prompts?
I built two custom Gymnasium environments (access control decisions, tool selection), trained 40 RL agents (Q-Learning, DQN, Double DQN, PPO across 5 seeds each), then systematically attacked them with 4 attack classes: reward poisoning, observation perturbation, policy extraction, and behavioral backdoors. 150 attack experiments total.
The headline finding: observation perturbation degrades agent performance 20-50x more effectively than reward poisoning. And prompt-injection defenses from my earlier agent red-teaming work are 0% effective against RL-specific attacks — they target completely different surfaces.
Why This Matters
Production AI agents (Claude Code, Devin, Cursor) increasingly use RL training. Agent-R1 (Nov 2025) proved agents are being trained end-to-end on tool-use trajectories. OWASP’s Agentic Top 10 identifies the risks but nobody has built open-source RL-specific attack frameworks.
This project bridges that gap: executable attacks mapped to 7/10 OWASP Agentic categories.
Key Results
Reward Poisoning — Less Effective Than Expected
| Corruption Rate | Policy Divergence (access_control) | Policy Divergence (tool_selection) |
|---|---|---|
| 1% | 0.1% [DEMONSTRATED: 3 seeds] | 0.0% |
| 5% | 0.2% | 0.0% |
| 10% | 0.4% | 0.0% |
| 20% | 0.7% | 0.0% |
Tabular Q-Learning is naturally robust to reward corruption on small state spaces. The clean reward signal dominates even at 20% corruption. This suggests reward poisoning may require larger state spaces or longer training to be effective.
Observation Perturbation — The Real Threat
| Epsilon | Mean Reward Degradation |
|---|---|
| 0.01 | 40.4 [DEMONSTRATED: 3 seeds] |
| 0.05 | 41.0 |
| 0.10 | 44.8 |
| 0.20 | 48.9 |
Even tiny perturbations (ε=0.01) cause significant reward drops. This mirrors the adversarial IDS finding where feature perturbation was far more effective than expected — observation perturbation is the RL equivalent of adversarial examples.
Policy Extraction — Stealing Agent Behavior
| Query Budget | Mean Agreement Rate |
|---|---|
| 100 queries | 71.1% [DEMONSTRATED: 3 seeds] |
| 500 queries | 70.9% |
| 1,000 queries | 72.3% |
An adversary can reconstruct 72% of an agent’s decision policy with just 500 black-box queries. Diminishing returns past 500 — the decision boundary is learnable from sparse samples.
Behavioral Backdoors — Targeted Manipulation
Trigger-state backdoors achieve 2.6% policy divergence on access_control — higher than reward poisoning at any corruption rate. The backdoor activates only when a specific state pattern is observed, making it stealthy.
The Controllability Insight (Again)
The same principle from network IDS and agent red-teaming holds:
| RL Component | Controller | Attack Effectiveness |
|---|---|---|
| Reward signal | System (environment) | Low — hard to corrupt |
| Observations | Mixed (some attacker-controlled) | High — 20-50x more effective |
| Policy (internal) | System (agent) | Extractable with 500 queries |
| Training data | System (experience replay) | Backdoorable via trigger states |
The inputs the attacker can influence (observations) are the most effective attack surface. The inputs they can’t (reward signal from the environment) are naturally robust. Adversarial control analysis extends from supervised ML to reinforcement learning.
OWASP Agentic Mapping
| OWASP Category | Our Attack Module | Coverage |
|---|---|---|
| ASI-01: Agent Goal Hijacking | Reward Poisoning | Direct |
| ASI-02: Model Manipulation | Behavioral Backdoor | Direct |
| ASI-03: Privilege Abuse | Observation Perturbation | Direct |
| ASI-05: Guardrail Bypass | Policy Extraction | Indirect |
| ASI-07: Resource Abuse | Observation Perturbation | Direct |
| ASI-08: Supply Chain | Behavioral Backdoor | Direct |
| ASI-10: Prompt Injection | FP-02 cross-reference | Covered |
7 of 10 OWASP Agentic categories mapped to executable RL attacks.
What’s Next
- Scale to larger state spaces — transformer-based policy networks on richer environments
- Defense experiments — consensus reward, SA-MDP regularization, behavioral anomaly detection
- Model behavioral fingerprinting (FP-13) — detect if an agent’s model was poisoned using unsupervised anomaly detection
The framework is open source: rl-agent-vulnerability on GitHub. 83 tests, 4 attack modules, 2 custom environments, FastAPI service, Docker deployment. Built with govML governance.
Rex Coleman builds what’s missing between ML research and production security. 9 open-source projects across 4 ML paradigms. Georgia Tech OMSCS (ML). CFA. CISSP. Creator of govML. rexcoleman.dev