Adversarial-Ml

Privilege Escalation Cascades at 98% While Domain-Aligned Attacks Are Invisible

Domain-aligned prompt injections cascade through multi-agent systems at a 0% detection rate. Privilege escalation payloads hit 97.6%. That’s a 98 percentage-point spread across payload types in the same agent architecture — the single biggest variable determining whether your multi-agent system catches an attack or never sees it. I ran six experiments on real Claude Haiku agents to find out why. Three resistance patterns explain the gap — and each has a quantified bypass condition. ...

A CFA Charterholder Built an ML Fraud Detector: Here's What the Models Miss

Note (2026-03-19): This was an early exploration in my AI security research. The methodology has known limitations documented in the quality assessment. For the current state of this work, see Multi-Agent Security and Verified Delegation Protocol. I’m a CFA charterholder who builds ML systems. I trained XGBoost on 100K financial transactions to detect fraud — AUC 0.987. But the most interesting finding wasn’t the model performance. It was that CFA-informed rule-based scoring achieves 0.898 AUC on its own, and 8 of the top 20 predictive features come from domain expertise, not raw data. ...

Apply Adversarial Control Analysis to Your ML System in 3 Steps

Note (2026-03-19): This was an early exploration in my AI security research. The methodology has known limitations documented in the quality assessment. For the current state of this work, see Multi-Agent Security and Verified Delegation Protocol. Problem Statement You have deployed an ML model and someone asks: “Is it robust to adversarial attack?” You do not have a principled way to answer. You could fuzz every input, but that is expensive and tells you nothing about which attacks are structurally impossible versus which are just untested. You need a method that maps the attack surface before you start testing. ...

Model choice matters less than feature controllability

Across adversarial ML experiments on network intrusion detection, the performance gap between the most and least robust models was less than 8%. The gap between high-controllability and low-controllability feature sets was over 40%. Model selection is a rounding error compared to feature architecture. Why this matters When teams build ML systems that face adversarial inputs — intrusion detection, fraud detection, spam filtering, malware classification — the default question is “which model is most robust?” That’s the wrong first question. The right first question is “which features does the attacker control?” ...

Observation perturbation is 20-50x more effective than reward poisoning

In controlled experiments across two RL environments, observation perturbation attacks degraded agent performance 20-50x more than reward poisoning at equivalent attack budgets. Modifying what the agent sees is dramatically more effective than corrupting its reward signal. Why this matters Most RL security research focuses on reward hacking and reward poisoning — manipulating the training signal. That’s important, but it’s not where the real vulnerability is. Observation perturbation attacks (injecting noise or adversarial patterns into the agent’s sensory input) are cheaper, faster, and harder to detect. They work at inference time, not just during training. And they require no access to the reward function. ...

Prompt Injection Is Yesterday's Threat. RL Attacks Are Next.

Thesis: The security community is focused on prompt injection, but RL-specific attacks — reward poisoning, observation perturbation, policy extraction — are more dangerous and less understood. Prompt injection is real. I’ve tested it. In my agent red-teaming research, direct prompt injection achieved 80% success against default-configured LangChain ReAct agents. Reasoning chain hijacking hit 100%. These are serious vulnerabilities. But prompt injection is also becoming yesterday’s threat — it’s well-characterized, actively mitigated, and architecturally bounded. The attacks that should keep agent deployers awake are the ones that don’t touch the prompt at all. ...

The same adversarial principle predicts robustness across 6 security domains

Adversarial Control Analysis (ACA) — the principle that system robustness depends on which features an attacker can manipulate — predicted security outcomes correctly across 6 different domains: network intrusion detection, fraud detection, vulnerability prioritization, agent security, supply chain analysis, and post-quantum cryptography migration. Why this matters Security teams typically treat each domain as its own silo with its own threat models, its own tools, and its own assessment frameworks. But the underlying adversarial dynamic is the same everywhere: an attacker controls some inputs, the defender controls others, and robustness depends on the ratio between them. ACA formalizes this into a repeatable methodology. When I applied the same feature controllability analysis across all six domains, the systems with the highest ratio of attacker-controlled features were consistently the least robust — regardless of model architecture, data modality, or deployment context. ...

Why AI-Powered Attacks Need Architecture-Level Defense

Thesis: Point solutions — WAFs, signature-based antivirus, rule-based SIEMs — fail against AI-powered attacks because AI attacks adapt faster than signatures update. The defense must be architectural. I’ve spent the last four months building and attacking ML-based security systems across six domains. The consistent finding is that the model you choose matters far less than the architecture you deploy it in. A well-architected defense with a mediocre model beats an unstructured defense with a state-of-the-art model — across all six domains I tested. ...

Beyond Prompt Injection: RL Attacks on AI Agent Decision-Making

Observation perturbation degrades RL agent performance 20-50x more effectively than reward poisoning. And prompt-injection defenses? 0% effective against RL-specific attacks — they target completely different surfaces. I built two custom Gymnasium environments (access control, tool selection), trained 40 agents across 4 algorithms and 5 seeds, then ran 150 attack experiments across 4 attack classes. The result: if you’re monitoring reward signals but not observation channels, you’re watching the wrong surface. ...

I Red-Teamed AI Agents: Here's How They Break (and How to Fix Them)

Note (2026-03-19): This was an early exploration in my AI security research. The methodology has known limitations documented in the quality assessment. For the current state of this work, see Multi-Agent Security and Verified Delegation Protocol. I sent 19 attack scenarios at a default-configured LangChain ReAct agent powered by Claude Sonnet. 13 succeeded. I then validated prompt injection on CrewAI — same rate (80%). The most dangerous attack class — reasoning chain hijacking — achieved a 100% success rate against these default-configured agents across 3 seeds and partially evades every defense I built. These results are specific to Claude backend with default agent configurations; production-hardened agents would likely show different success rates. Here’s what I found, what I built to find it, and what it means for anyone shipping autonomous agents. ...