Why CVSS Gets It Wrong: ML-Powered Vulnerability Prioritization

After 15 years of incident response at Mandiant, I watched security teams burn countless hours patching CVSS 9.8 vulnerabilities that never got exploited — while CVSS 7.5s got weaponized and led to breaches. CVSS measures severity. Attackers measure opportunity. I trained an ML model on 338,000 real CVEs to find out what actually predicts which vulnerabilities get exploited in the wild — and the answer is not what CVSS thinks it is.

The Data

Three public data sources, joined by CVE ID:

Source	Records	Purpose
NVD (NIST)	337,953 CVEs	Features: CVSS scores, CWE types, descriptions, vendor/product, references
ExploitDB	24,936 CVEs with known exploits	Ground truth label: “was this CVE actually exploited?”
EPSS (First.org)	320,502 scores	Baseline comparison: an existing ML-based prediction

Temporal split: Train on pre-2024 CVEs (234,601), test on 2024+ (103,352). This prevents data leakage from future information — in production, you always predict on CVEs you haven’t seen yet.

49 Features from Practitioner Knowledge

I engineered 49 features across six categories:

CVSS components — base score, attack vector, complexity, privileges required
Temporal — publication year, month, day-of-week, CVE age in days
Vendor metadata — number of CVEs the vendor has (a proxy for deployment ubiquity)
CWE classification — top 20 weakness types as one-hot features
References — count, presence of exploit references, presence of patch references
Practitioner keywords — 11 binary features encoding terms I know from Mandiant triage: remote_code_execution, sql_injection, buffer_overflow, privilege_escalation, authentication_bypass, denial_of_service, xss, information_disclosure, arbitrary_code, allows_attackers, crafted

The keyword features are the “practitioner vs formula” thesis made explicit. If these features rank high in SHAP importance, it validates that domain knowledge has signal CVSS doesn’t capture.

Results: ML Crushes CVSS (+24pp AUC)

Model	AUC-ROC	vs CVSS
Logistic Regression	0.903	+24.1pp
Random Forest	0.864	+20.2pp
XGBoost	0.825	+16.3pp
Best CVSS Threshold (≥9.0)	0.662	baseline
EPSS (already ML-based)	0.912	+25.1pp

CVSS predicts exploitability with an AUC of 0.662 — barely better than random for a binary classifier. The simplest ML model (Logistic Regression) achieves 0.903. EPSS, which is already an ML model trained on richer data, achieves 0.912.

The interesting question isn’t “can ML beat CVSS?” — that’s obvious. It’s “what does the model see that CVSS doesn’t?”

SHAP Reveals What Actually Predicts Exploitation

The top predictors of real-world exploitation, ranked by SHAP importance:

#1: How many CVEs a vendor has (vendor_cve_count). This is the single strongest predictor, and it’s not what most people expect. Vendors with large CVE histories — Microsoft, Apache, Oracle, Linux kernel — get exploited disproportionately. Not because their code is worse, but because attackers invest where the payoff is highest. A vulnerability in software deployed across millions of endpoints is worth weaponizing; a vulnerability in a niche product isn’t. From 15 years of Mandiant incident response, the pattern is consistent: threat actors maintain exploit toolkits for high-deployment-count vendors and add new CVEs to existing toolchains. The attacker’s calculus is “how many targets does this give me access to?” — and vendor CVE count is a proxy for deployment ubiquity.

#2: How old the CVE is (cve_age_days). Weaponization is not instant. The vulnerability lifecycle follows a predictable arc: disclosure → proof-of-concept (days to weeks) → integration into exploit kits (weeks to months) → active exploitation in the wild (months to years). A CVE that’s been public for 6 months without a known exploit is less urgent than one that’s been public for 2 years with active weaponization. Age is a feature CVSS ignores entirely.

#3: Description length. Longer CVE descriptions correlate with exploitation because complex, multi-step vulnerabilities require more detailed documentation. A simple null pointer dereference gets a 2-sentence description. A chained vulnerability involving authentication bypass, privilege escalation, and remote code execution gets a paragraph — and is the kind of bug threat actors invest in weaponizing.

#8: SQL injection keyword. SQLi has been the single most reliably exploitable vulnerability class for two decades — well-understood, tooling is mature (sqlmap), and it provides direct data access.

#12: Remote code execution keyword. RCE is the ultimate attacker goal: arbitrary code execution means game over.

CVSS score? #5. The formula everyone uses for prioritization is the fifth most important feature. Vendor history, vulnerability age, and description complexity all matter more.

Adversarial Robustness: 0% Evasion

I applied the same adversarial control analysis I developed for intrusion detection:

Feature Category	Count	Examples
Attacker-controllable	15	Description text, keywords, reference links
Defender-observable only	11	CVSS score, CWE, EPSS, publication date, vendor history

Three attacks on the description text (synonym substitution, field injection, noise perturbation) achieved 0% evasion. The model is naturally robust because its top features (vendor_cve_count, cve_age_days, cvss_score, epss_percentile) are all defender-observable. An attacker can rewrite the CVE description to hide an RCE, but they can’t change the vendor’s CVE history, the publication date, or the EPSS score.

This validates the adversarial control analysis across a second domain. The first validation was on network intrusion detection (packet features). This is on vulnerability metadata (CVE features). Same principle, different domain: design ML systems so decision-critical inputs are outside adversary control.

What This Means for Vulnerability Management

Stop prioritizing by CVSS alone. It’s the 5th most important feature. Vendor deployment ubiquity and vulnerability age are stronger signals.
EPSS mostly works. Our model achieves 99% of EPSS performance using only public data. If you’re already using EPSS, you’re ahead of most teams.
The model is hard to game. Because it relies on features attackers can’t manipulate, advisory-level deception (downplaying a CVE’s description) doesn’t change the prediction.

Limitations

Ground truth lag: ExploitDB labels for 2024+ CVEs are incomplete — many exploited vulns haven’t been catalogued yet. Test exploit rate is only 0.3%.
No proprietary data: EPSS has access to threat intelligence feeds and social media that we don’t. Fair comparison on methodology, not data.
Single seed: Results shown for seed=42. Multi-seed stability analysis is a follow-up.

Code

Full pipeline (ingest → features → models → SHAP → adversarial eval) is open source:

github.com/rexcoleman/vuln-prioritization-ml-

Built with govML governance — 11 architectural decisions logged, every experiment reproducible.

Rex Coleman builds what’s missing between ML research and production security. 9 open-source projects across 4 ML paradigms. Georgia Tech OMSCS (ML). CFA. CISSP. Creator of govML. rexcoleman.dev

The Data#

49 Features from Practitioner Knowledge#

Results: ML Crushes CVSS (+24pp AUC)#

SHAP Reveals What Actually Predicts Exploitation#

Adversarial Robustness: 0% Evasion#

What This Means for Vulnerability Management#

Limitations#

Code#