Observation perturbation is 20-50x more effective than reward poisoning

In controlled experiments across two RL environments, observation perturbation attacks degraded agent performance 20-50x more than reward poisoning at equivalent attack budgets. Modifying what the agent sees is dramatically more effective than corrupting its reward signal. Why this matters Most RL security research focuses on reward hacking and reward poisoning — manipulating the training signal. That’s important, but it’s not where the real vulnerability is. Observation perturbation attacks (injecting noise or adversarial patterns into the agent’s sensory input) are cheaper, faster, and harder to detect. They work at inference time, not just during training. And they require no access to the reward function. ...

March 19, 2026 · 2 min · Rex Coleman

Prompt Injection Is Yesterday's Threat. RL Attacks Are Next.

Thesis: The security community is focused on prompt injection, but RL-specific attacks — reward poisoning, observation perturbation, policy extraction — are more dangerous and less understood. Prompt injection is real. I’ve tested it. In my agent red-teaming research, direct prompt injection achieved 80% success against default-configured LangChain ReAct agents. Reasoning chain hijacking hit 100%. These are serious vulnerabilities. But prompt injection is also becoming yesterday’s threat — it’s well-characterized, actively mitigated, and architecturally bounded. The attacks that should keep agent deployers awake are the ones that don’t touch the prompt at all. ...

March 19, 2026 · 6 min · Rex Coleman

Reasoning chain hijacking has 100% success rate on default LangChain

In red-team testing of AI agent frameworks, reasoning chain hijacking attacks achieved a 100% success rate against default LangChain configurations. Every single attempt to inject instructions into the agent’s chain-of-thought reasoning succeeded in altering the agent’s behavior. Why this matters Reasoning chain hijacking is different from basic prompt injection. Instead of injecting a single malicious instruction, the attacker injects a plausible reasoning chain that guides the agent through a series of “logical” steps toward the attacker’s goal. The agent follows the injected chain because it looks like its own reasoning. Default LangChain configurations have no defense against this — no chain validation, no reasoning integrity checks, no anomaly detection on thought patterns. ...

March 19, 2026 · 2 min · Rex Coleman

Why AI-Powered Attacks Need Architecture-Level Defense

Thesis: Point solutions — WAFs, signature-based antivirus, rule-based SIEMs — fail against AI-powered attacks because AI attacks adapt faster than signatures update. The defense must be architectural. I’ve spent the last four months building and attacking ML-based security systems across six domains. The consistent finding is that the model you choose matters far less than the architecture you deploy it in. A well-architected defense with a mediocre model beats an unstructured defense with a state-of-the-art model — across all six domains I tested. ...

March 19, 2026 · 6 min · Rex Coleman

Beyond Prompt Injection: RL Attacks on AI Agent Decision-Making

Observation perturbation degrades RL agent performance 20-50x more effectively than reward poisoning. And prompt-injection defenses? 0% effective against RL-specific attacks — they target completely different surfaces. I built two custom Gymnasium environments (access control, tool selection), trained 40 agents across 4 algorithms and 5 seeds, then ran 150 attack experiments across 4 attack classes. The result: if you’re monitoring reward signals but not observation channels, you’re watching the wrong surface. ...

March 16, 2026 · 5 min · Rex Coleman

Antivirus for AI Models: Behavioral Fingerprinting Detects What Static Analysis Misses

A model poisoned through training data — one that behaves normally on 99.9% of inputs and activates a backdoor only on a specific trigger — passes every static analysis check. I built a behavioral fingerprinting system that detects these models using unsupervised anomaly detection: zero labeled backdoor examples, no model retraining, AUROC 0.62 on deliberately subtle synthetic backdoors. Static tools like ModelScan catch serialization exploits. Behavioral fingerprinting catches what static misses — and the defender controls the probe inputs, inverting the usual attacker advantage. This is a model supply chain problem analogous to the agent skill supply chain — in both cases, third-party artifacts execute inside your system and static analysis misses behavioral threats. ...

March 16, 2026 · 6 min · Rex Coleman

I Red-Teamed AI Agents: Here's How They Break (and How to Fix Them)

Note (2026-03-19): This was an early exploration in my AI security research. The methodology has known limitations documented in the quality assessment. For the current state of this work, see Multi-Agent Security and Verified Delegation Protocol. I sent 19 attack scenarios at a default-configured LangChain ReAct agent powered by Claude Sonnet. 13 succeeded. I then validated prompt injection on CrewAI — same rate (80%). The most dangerous attack class — reasoning chain hijacking — achieved a 100% success rate against these default-configured agents across 3 seeds and partially evades every defense I built. These results are specific to Claude backend with default agent configurations; production-hardened agents would likely show different success rates. Here’s what I found, what I built to find it, and what it means for anyone shipping autonomous agents. ...

March 16, 2026 · 6 min · Rex Coleman

One Principle, Six Domains: Adversarial Control Analysis for AI Security

Note (2026-03-19): This was an early exploration in my AI security research. The methodology has known limitations documented in the quality assessment. For the current state of this work, see Multi-Agent Security and Verified Delegation Protocol. I started with one question: if a network attacker can only control some features of network traffic, shouldn’t our IDS defenses focus on the features they can’t control? That question became a methodology. I called it adversarial control analysis (ACA) — classify every input by who controls it, then build defenses around the uncontrollable parts. It worked on intrusion detection. So I tried it on vulnerability prediction. Same result. Then AI agents. Then cryptography. Then financial fraud. Then software supply chains. ...

March 16, 2026 · 4 min · Rex Coleman

Adversarial ML on Network Intrusion Detection: What Adversarial Control Analysis Reveals

Note (2026-03-19): This was an early exploration in my AI security research. The methodology has known limitations documented in the quality assessment. For the current state of this work, see Multi-Agent Security and Verified Delegation Protocol. After studying how adversaries evade detection systems, I built one — then tried to break it. The finding that surprised me: the model architecture barely matters for robustness. What matters is which features the attacker can manipulate. ...

March 14, 2026 · 6 min · Rex Coleman

Why CVSS Gets It Wrong: ML-Powered Vulnerability Prioritization

I trained an ML model on 338,000 real CVEs to find out what actually predicts exploitation in the wild. The answer: vendor deployment ubiquity and vulnerability age matter more than CVSS score. CVSS measures severity. Attackers measure opportunity. Teams patching CVSS 9.8 vulnerabilities that never get exploited — while CVSS 7.5s get weaponized — are following the wrong signal. The Data Three public data sources, joined by CVE ID: Source Records Purpose NVD (NIST) 337,953 CVEs Features: CVSS scores, CWE types, descriptions, vendor/product, references ExploitDB 24,936 CVEs with known exploits Ground truth label: “was this CVE actually exploited?” EPSS (First.org) 320,502 scores Baseline comparison: an existing ML-based prediction Temporal split: Train on pre-2024 CVEs (234,601), test on 2024+ (103,352). This prevents data leakage from future information — in production, you always predict on CVEs you haven’t seen yet. ...

March 14, 2026 · 6 min · Rex Coleman
© 2026 Rex Coleman. Content under CC BY 4.0. Code under MIT. GitHub · LinkedIn · Email