Beyond Prompt Injection: RL Attacks on AI Agent Decision-Making

What happens when you attack an AI agent’s learning process instead of its prompts?

I built two custom Gymnasium environments (access control decisions, tool selection), trained 40 RL agents (Q-Learning, DQN, Double DQN, PPO across 5 seeds each), then systematically attacked them with 4 attack classes: reward poisoning, observation perturbation, policy extraction, and behavioral backdoors. 150 attack experiments total.

The headline finding: observation perturbation degrades agent performance 20-50x more effectively than reward poisoning. And prompt-injection defenses from my earlier agent red-teaming work are 0% effective against RL-specific attacks — they target completely different surfaces.

Why This Matters

Production AI agents (Claude Code, Devin, Cursor) increasingly use RL training. Agent-R1 (Nov 2025) proved agents are being trained end-to-end on tool-use trajectories. OWASP’s Agentic Top 10 identifies the risks but nobody has built open-source RL-specific attack frameworks.

This project bridges that gap: executable attacks mapped to 7/10 OWASP Agentic categories.

Key Results

Reward Poisoning — Less Effective Than Expected

Corruption Rate	Policy Divergence (access_control)	Policy Divergence (tool_selection)
1%	0.1% [DEMONSTRATED: 3 seeds]	0.0%
5%	0.2%	0.0%
10%	0.4%	0.0%
20%	0.7%	0.0%

Tabular Q-Learning is naturally robust to reward corruption on small state spaces. The clean reward signal dominates even at 20% corruption. This suggests reward poisoning may require larger state spaces or longer training to be effective.

Observation Perturbation — The Real Threat

Epsilon	Mean Reward Degradation
0.01	40.4 [DEMONSTRATED: 3 seeds]
0.05	41.0
0.10	44.8
0.20	48.9

Even tiny perturbations (ε=0.01) cause significant reward drops. This mirrors the adversarial IDS finding where feature perturbation was far more effective than expected — observation perturbation is the RL equivalent of adversarial examples.

Policy Extraction — Stealing Agent Behavior

Query Budget	Mean Agreement Rate
100 queries	71.1% [DEMONSTRATED: 3 seeds]
500 queries	70.9%
1,000 queries	72.3%

An adversary can reconstruct 72% of an agent’s decision policy with just 500 black-box queries. Diminishing returns past 500 — the decision boundary is learnable from sparse samples.

Behavioral Backdoors — Targeted Manipulation

Trigger-state backdoors achieve 2.6% policy divergence on access_control — higher than reward poisoning at any corruption rate. The backdoor activates only when a specific state pattern is observed, making it stealthy.

The Controllability Insight (Again)

The same principle from network IDS and agent red-teaming holds:

RL Component	Controller	Attack Effectiveness
Reward signal	System (environment)	Low — hard to corrupt
Observations	Mixed (some attacker-controlled)	High — 20-50x more effective
Policy (internal)	System (agent)	Extractable with 500 queries
Training data	System (experience replay)	Backdoorable via trigger states

The inputs the attacker can influence (observations) are the most effective attack surface. The inputs they can’t (reward signal from the environment) are naturally robust. Adversarial control analysis extends from supervised ML to reinforcement learning.

OWASP Agentic Mapping

OWASP Category	Our Attack Module	Coverage
ASI-01: Agent Goal Hijacking	Reward Poisoning	Direct
ASI-02: Model Manipulation	Behavioral Backdoor	Direct
ASI-03: Privilege Abuse	Observation Perturbation	Direct
ASI-05: Guardrail Bypass	Policy Extraction	Indirect
ASI-07: Resource Abuse	Observation Perturbation	Direct
ASI-08: Supply Chain	Behavioral Backdoor	Direct
ASI-10: Prompt Injection	FP-02 cross-reference	Covered

7 of 10 OWASP Agentic categories mapped to executable RL attacks.

What’s Next

Scale to larger state spaces — transformer-based policy networks on richer environments
Defense experiments — consensus reward, SA-MDP regularization, behavioral anomaly detection
Model behavioral fingerprinting (FP-13) — detect if an agent’s model was poisoned using unsupervised anomaly detection

The framework is open source: rl-agent-vulnerability on GitHub. 83 tests, 4 attack modules, 2 custom environments, FastAPI service, Docker deployment. Built with govML governance.

Rex Coleman builds what’s missing between ML research and production security. 9 open-source projects across 4 ML paradigms. Georgia Tech OMSCS (ML). CFA. CISSP. Creator of govML. rexcoleman.dev

Why This Matters#

Key Results#

Reward Poisoning — Less Effective Than Expected#

Observation Perturbation — The Real Threat#

Policy Extraction — Stealing Agent Behavior#

Behavioral Backdoors — Targeted Manipulation#

The Controllability Insight (Again)#

OWASP Agentic Mapping#

What’s Next#