Negative-Results

Our Simulation Was Wrong by 37 Percentage Points — What Real LLM Agents Taught Us About Multi-Agent Cascade

I built a multi-agent security simulation, ran 6 experiments, then validated against real Claude Haiku agents. The simulation predicted 97% poison rate. Real agents: 60%. And the biggest surprise: topology matters — something the simulation said was irrelevant. What I Built A simulation-based testbed that models multi-agent systems with configurable trust architectures, network topologies, attacker types, and agent compositions. One agent gets compromised. We measure how poisoned outputs cascade through the system. ...

Your AI Makes SQL Injection Worse: CWE-Stratified Patch Safety for LLM Code Generation

LLM-generated security patches have a 42% fix rate and a 10% regression rate — but the aggregate hides a dangerous pattern. SQL injection patches are net-negative: 0% fix rate, 50% regression. The model recognizes the vulnerability but its rewrites introduce new injection vectors. Cryptography patches, by contrast, hit 100% fix rate with 0% regression. I tested Claude Haiku generating patches for 50 vulnerable code snippets across 5 CWE categories, measured by static analysis for both fix rate and regression rate. ...

Your AI Can't Beat EPSS at Vulnerability Triage (But the Ensemble Might)

Can an LLM agent prioritize vulnerabilities better than EPSS? Every security team drowning in CVEs wants to know whether AI can help them triage faster. We tested this empirically: Claude Haiku as a vulnerability triage agent, ranked against EPSS, CVSS, and random baselines, with CISA KEV as ground truth for “actually exploited.” The short answer: no, the agent doesn’t beat EPSS. But the longer answer is more interesting. ...

We Built a Multi-Agent Defense and It Failed — Here's Why That Matters More

We proposed a verified delegation protocol — LLM-as-judge verification, cryptographic signing, adaptive rate limiting — and pre-registered 7 hypotheses predicting it would reduce multi-agent cascade poison by 70%. Then we tested it on real Claude agents. Five hypotheses were refuted. The protocol doesn’t work. And that’s the finding. ...