Adversarial-Ml

A CFA Charterholder Built an ML Fraud Detector: Here's What the Models Miss

I’m a CFA charterholder who builds ML systems. I trained XGBoost on 100K financial transactions to detect fraud — AUC 0.987. But the most interesting finding wasn’t the model performance. It was that CFA-informed rule-based scoring achieves 0.898 AUC on its own, and 8 of the top 20 predictive features come from domain expertise, not raw data. Here’s what happens when you bring financial analysis training to ML fraud detection. ...

Adversarial Control Analysis: A Unified Framework for Designing ML Systems That Survive Adversaries Across Six Security Domains

Abstract Machine learning systems deployed in adversarial environments face a fundamental challenge: attackers manipulate inputs to evade detection, yet most adversarial ML research treats all features as equally perturbable. We introduce Adversarial Control Analysis (ACA), a framework that classifies every input to an ML system by its controller — attacker-controlled, defender-observable, system-determined, or nature-governed — and uses this classification to predict adversarial robustness and guide architectural defense. We apply ACA across six security domains: network intrusion detection (57/78 features attacker-controllable; constraining perturbations to controllable features reduces attack success by 35%), vulnerability prioritization (EPSS, a system-controlled signal, dominates prediction at 2x the SHAP importance of any other feature), AI agent security (attack success correlates inversely with defender observability, from 25% on observable inputs to 100% on internal state), post-quantum cryptography migration (70% of crypto findings are library-controlled, not developer-actionable), financial fraud detection (system-controlled features achieve 81% of full model performance), and AI supply chain security (75% of findings are developer-controlled). In every domain, ACA correctly predicts which features and defenses will survive adversarial pressure. The framework provides a three-step methodology — Enumerate, Classify, Architect — that security practitioners can apply before writing a single line of model code. ACA formalizes the principle that security architecture, not model optimization, determines adversarial robustness. ...

Apply Adversarial Control Analysis to Your ML System in 3 Steps

Problem Statement You have deployed an ML model and someone asks: “Is it robust to adversarial attack?” You do not have a principled way to answer. You could fuzz every input, but that is expensive and tells you nothing about which attacks are structurally impossible versus which are just untested. You need a method that maps the attack surface before you start testing. Adversarial Control Analysis (ACA) gives you that map. It is a three-step process that classifies every input by who controls it, then focuses your defenses on the inputs the adversary cannot manipulate. I have applied it across six domains – network IDS, vulnerability management, AI agents, post-quantum crypto, fraud detection, and ML supply chains – and the finding is always the same: the inputs the attacker cannot touch are your real defense. ...

Beyond Prompt Injection: Observation Perturbation Dominates Reward Poisoning by 20-50x in RL Agent Attacks

Abstract Autonomous agents powered by reinforcement learning (RL) are deploying into production security workflows including access control enforcement, tool selection, and incident response. Current agent security research focuses overwhelmingly on prompt injection attacks targeting the large language model (LLM) reasoning layer, leaving RL-specific attack surfaces largely uncharacterized. We present a systematic evaluation of four attack classes — reward poisoning, observation perturbation, policy extraction, and behavioral backdoors — against 40 RL agents (Q-Learning, DQN, Double DQN, PPO across 5 seeds) trained on two security-relevant custom Gymnasium environments: AccessControl (gridworld-based access policy enforcement) and ToolSelection (resource allocation under constraints). Across 150 attack experiments with 3-seed validation, we find that observation perturbation degrades agent performance 20–50x more effectively than reward poisoning, even at minimal perturbation budgets (epsilon = 0.01). Policy extraction achieves 72% agreement with victim policies using only 500 black-box queries, enabling offline attack rehearsal. We map all attacks to the OWASP Agentic Security Initiative taxonomy, covering 7 of 10 categories, and identify 5 RL-specific attack classes absent from current frameworks. We propose a controllability analysis framework that unifies vulnerability prediction across LLM and RL agent layers. All code and data are open-source. ...

Model choice matters less than feature controllability

Across adversarial ML experiments on network intrusion detection, the performance gap between the most and least robust models was less than 8%. The gap between high-controllability and low-controllability feature sets was over 40%. Model selection is a rounding error compared to feature architecture. Why this matters When teams build ML systems that face adversarial inputs — intrusion detection, fraud detection, spam filtering, malware classification — the default question is “which model is most robust?” That’s the wrong first question. The right first question is “which features does the attacker control?” ...

Observation perturbation is 20-50x more effective than reward poisoning

In controlled experiments across two RL environments, observation perturbation attacks degraded agent performance 20-50x more than reward poisoning at equivalent attack budgets. Modifying what the agent sees is dramatically more effective than corrupting its reward signal. Why this matters Most RL security research focuses on reward hacking and reward poisoning — manipulating the training signal. That’s important, but it’s not where the real vulnerability is. Observation perturbation attacks (injecting noise or adversarial patterns into the agent’s sensory input) are cheaper, faster, and harder to detect. They work at inference time, not just during training. And they require no access to the reward function. ...

Prompt Injection Is Yesterday's Threat. RL Attacks Are Next.

Thesis: The security community is focused on prompt injection, but RL-specific attacks — reward poisoning, observation perturbation, policy extraction — are more dangerous and less understood. Prompt injection is real. I’ve tested it. In my agent red-teaming research, direct prompt injection achieved 80% success against default-configured LangChain ReAct agents. Reasoning chain hijacking hit 100%. These are serious vulnerabilities. But prompt injection is also becoming yesterday’s threat — it’s well-characterized, actively mitigated, and architecturally bounded. The attacks that should keep agent deployers awake are the ones that don’t touch the prompt at all. ...

Systematic Red-Teaming of AI Agents: A 7-Class Attack Taxonomy with Controllability-Based Defense Architecture

Abstract Autonomous AI agents that reason, use tools, and take actions are being deployed into production systems at scale, yet no systematic methodology exists for evaluating their security posture beyond the model layer. We present an open-source red-team framework that executes 19 attack scenarios across 7 attack classes against LangChain ReAct agents with a Claude Sonnet backend. Five of the seven attack classes target agent-specific surfaces not covered by OWASP LLM Top 10 or MITRE ATLAS. Our most significant finding is reasoning chain hijacking, an attack that exploits an agent’s core capability — following structured multi-step plans — as the attack vector, achieving a 100% success rate against default-configured agents across 3 seeds (temperature=0). We introduce adversarial control analysis (ACA) to the agent security domain and demonstrate that attack success correlates inversely with defender observability: the reasoning chain, being internal to the agent’s processing loop, is both the least observable input and the most vulnerable surface. A layered defense architecture (input sanitization, LLM-as-judge, tool permission boundaries) achieves 67% average attack reduction, but reasoning chain hijacking remains the highest-priority unsolved problem, decreasing only from 100% to 33% success. The framework, attack taxonomy, and all scenarios are released as open source to enable reproducible agent security evaluation. ...

The same adversarial principle predicts robustness across 6 security domains

Adversarial Control Analysis (ACA) — the principle that system robustness depends on which features an attacker can manipulate — predicted security outcomes correctly across 6 different domains: network intrusion detection, fraud detection, vulnerability prioritization, agent security, supply chain analysis, and post-quantum cryptography migration. Why this matters Security teams typically treat each domain as its own silo with its own threat models, its own tools, and its own assessment frameworks. But the underlying adversarial dynamic is the same everywhere: an attacker controls some inputs, the defender controls others, and robustness depends on the ratio between them. ACA formalizes this into a repeatable methodology. When I applied the same feature controllability analysis across all six domains, the systems with the highest ratio of attacker-controlled features were consistently the least robust — regardless of model architecture, data modality, or deployment context. ...

Why AI-Powered Attacks Need Architecture-Level Defense

Thesis: Point solutions — WAFs, signature-based antivirus, rule-based SIEMs — fail against AI-powered attacks because AI attacks adapt faster than signatures update. The defense must be architectural. I’ve spent the last four months building and attacking ML-based security systems across six domains. The consistent finding is that the model you choose matters far less than the architecture you deploy it in. A well-architected defense with a mediocre model beats an unstructured defense with a state-of-the-art model. Every time. ...