Build Your Own ML Vuln Prioritizer

Problem Statement

Your security team triages vulnerabilities by CVSS score. A 9.8 gets patched immediately; a 7.5 waits. But CVSS measures severity, not exploitability. In real-world data, CVSS achieves an AUC of just 0.662 at predicting which CVEs actually get exploited – barely better than a coin flip. You need a model that predicts exploitation likelihood, not just theoretical severity.

This tutorial walks you through building an ML-based vulnerability prioritizer using public data. By the end, you will have a Random Forest model that beats CVSS by over 20 percentage points on AUC-ROC, and you will understand exactly which features drive those predictions using SHAP.

NVD/CVE Data → Feature Engineering → ML Model → SHAP Explainability
    │              │                    │              │
  338K CVEs    49 features         Random Forest    Top-20 features
               (8 groups)          + 6 baselines    ranked by impact

Prerequisites

Python 3.9+ with pip
Basic familiarity with scikit-learn
~2 GB of free disk space for NVD data
About 30 minutes of compute time on a modern laptop

Install dependencies:

pip install pandas scikit-learn shap requests tqdm

Step 1: Download NVD and EPSS Data

The National Vulnerability Database (NVD) publishes every CVE with metadata – CVSS scores, CWE classifications, references, and descriptions. EPSS (Exploit Prediction Scoring System) publishes daily exploit-likelihood scores for every CVE. ExploitDB tells us which CVEs actually have public exploits.

import requests
import pandas as pd
import json
from pathlib import Path

DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)

# Download EPSS scores (CSV, updated daily)
epss_url = "https://epss.cyentia.com/epss_scores-current.csv.gz"
epss_df = pd.read_csv(epss_url, comment="#")
epss_df.to_csv(DATA_DIR / "epss_scores.csv", index=False)
print(f"EPSS scores: {len(epss_df)} CVEs")

# Download NVD data via API (paginated)
# Note: The NVD API has rate limits. Use an API key for faster access.
# Apply at https://nvd.nist.gov/developers/request-an-api-key
def fetch_nvd_page(start_index=0, results_per_page=2000, api_key=None):
    url = "https://services.nvd.nist.gov/rest/json/cves/2.0"
    params = {
        "startIndex": start_index,
        "resultsPerPage": results_per_page,
    }
    headers = {}
    if api_key:
        headers["apiKey"] = api_key
    resp = requests.get(url, params=params, headers=headers)
    resp.raise_for_status()
    return resp.json()

# For the tutorial, start with a smaller slice.
# The full dataset is ~338K CVEs and takes several hours to download.
print("Fetching first page of NVD data...")
page = fetch_nvd_page(results_per_page=2000)
print(f"Total CVEs available: {page['totalResults']}")

For ground truth labels, download ExploitDB’s files_exploits.csv from https://gitlab.com/exploit-database/exploitdb. Each row maps to a CVE that has a public exploit.

# After downloading files_exploits.csv from ExploitDB:
exploits_df = pd.read_csv(DATA_DIR / "files_exploits.csv")
exploit_cves = set()
for codes in exploits_df["codes"].dropna():
    for code in str(codes).split(";"):
        code = code.strip()
        if code.startswith("CVE-"):
            exploit_cves.add(code)
print(f"CVEs with public exploits: {len(exploit_cves)}")

Step 2: Engineer Features

The features that predict exploitation are not the ones most people expect. Based on the FP-05 research, the strongest predictors are EPSS percentile (threat intelligence signal), whether the CVE references exploit code, vendor CVE history (deployment ubiquity), and CVSS score. Practitioner keywords like “sql injection” and “remote code execution” help but are not dominant.

import numpy as np
from datetime import datetime

def engineer_features(cve_record):
    """Extract features from a single NVD CVE record."""
    features = {}
    cve_id = cve_record["cve"]["id"]

    # CVSS score (v3 preferred, fall back to v2)
    metrics = cve_record["cve"].get("metrics", {})
    cvss_v3 = None
    cvss_v2 = None
    if "cvssMetricV31" in metrics:
        cvss_v3 = metrics["cvssMetricV31"][0]["cvssData"]["baseScore"]
    if "cvssMetricV2" in metrics:
        cvss_v2 = metrics["cvssMetricV2"][0]["cvssData"]["baseScore"]
    features["cvss_score"] = cvss_v3 or cvss_v2 or 0.0
    features["cvss_v3"] = cvss_v3 or 0.0
    features["cvss_v2"] = cvss_v2 or 0.0
    features["has_cvss_v3"] = int(cvss_v3 is not None)

    # Description features
    descriptions = cve_record["cve"].get("descriptions", [])
    desc = ""
    for d in descriptions:
        if d["lang"] == "en":
            desc = d["value"]
            break
    features["desc_length"] = len(desc)
    features["desc_word_count"] = len(desc.split())

    # Practitioner keywords
    desc_lower = desc.lower()
    keywords = [
        "sql_injection", "remote_code_execution", "denial_of_service",
        "privilege_escalation", "arbitrary_code", "xss",
        "buffer_overflow", "directory_traversal", "authentication_bypass",
        "information_disclosure", "command_injection"
    ]
    for kw in keywords:
        features[f"kw_{kw}"] = int(kw.replace("_", " ") in desc_lower)

    # Reference features
    refs = cve_record["cve"].get("references", [])
    features["has_exploit_ref"] = int(
        any("exploit" in r.get("url", "").lower() for r in refs)
    )
    features["has_patch_ref"] = int(
        any("patch" in str(r.get("tags", [])).lower() for r in refs)
    )
    features["ref_count"] = len(refs)

    # CWE features
    weaknesses = cve_record["cve"].get("weaknesses", [])
    cwe_ids = []
    for w in weaknesses:
        for d in w.get("description", []):
            if d["value"].startswith("CWE-"):
                cwe_ids.append(d["value"])
    features["has_cwe"] = int(len(cwe_ids) > 0)
    features["cwe_count"] = len(cwe_ids)

    # Temporal features
    pub_date = cve_record["cve"].get("published", "")
    if pub_date:
        pub_dt = datetime.fromisoformat(pub_date.replace("Z", "+00:00"))
        features["pub_year"] = pub_dt.year
        features["pub_month"] = pub_dt.month
        features["cve_age_days"] = (datetime.now().astimezone() - pub_dt).days

    # Label
    features["exploited"] = int(cve_id in exploit_cves)
    features["cve_id"] = cve_id

    return features

Merge EPSS scores into your feature set:

# After building features_df from NVD records:
epss_df = pd.read_csv(DATA_DIR / "epss_scores.csv")
epss_df = epss_df.rename(columns={"cve": "cve_id", "epss": "epss_score",
                                    "percentile": "epss_percentile"})
features_df = features_df.merge(epss_df[["cve_id", "epss_score", "epss_percentile"]],
                                 on="cve_id", how="left")
features_df["epss_score"] = features_df["epss_score"].fillna(0.0)
features_df["epss_percentile"] = features_df["epss_percentile"].fillna(0.0)

Step 3: Split Temporally and Train a Random Forest

Use a temporal split, not a random split. CVEs published before 2024 go to training; 2024 and later go to test. This prevents future data from leaking into training and simulates how the model would perform on newly published CVEs.

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

# Temporal split
train_df = features_df[features_df["pub_year"] < 2024]
test_df = features_df[features_df["pub_year"] >= 2024]

feature_cols = [c for c in features_df.columns
                if c not in ["cve_id", "exploited", "pub_year"]]
X_train = train_df[feature_cols].fillna(0)
y_train = train_df["exploited"]
X_test = test_df[feature_cols].fillna(0)
y_test = test_df["exploited"]

print(f"Train: {len(X_train)} ({y_train.mean():.1%} exploited)")
print(f"Test:  {len(X_test)} ({y_test.mean():.1%} exploited)")

# Train Random Forest
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

# Evaluate
rf_proba = rf.predict_proba(X_test)[:, 1]
rf_auc = roc_auc_score(y_test, rf_proba)
print(f"Random Forest AUC: {rf_auc:.3f}")

Step 4: Compare Against CVSS Baseline

CVSS scores are the industry standard for vulnerability triage. Use the CVSS score directly as a “prediction” and measure AUC:

# CVSS as a naive predictor
cvss_auc = roc_auc_score(y_test, X_test["cvss_score"])
print(f"CVSS AUC:          {cvss_auc:.3f}")
print(f"Random Forest AUC: {rf_auc:.3f}")
print(f"Improvement:       +{(rf_auc - cvss_auc) * 100:.1f}pp")

In the FP-05 research on 338K CVEs, CVSS achieved AUC 0.662 while the ML model achieved 0.864 (Random Forest) to 0.903 (Logistic Regression). That is a +20 to +24 percentage point improvement. CVSS measures how bad a vulnerability could be. ML measures how likely it is to actually be exploited. Those are different questions, and attackers care about the second one.

Step 5: Explain with SHAP

Numbers alone are not enough for a security team. They need to understand why a particular CVE was ranked high. SHAP (SHapley Additive exPlanations) decomposes each prediction into per-feature contributions.

import shap

# SHAP explainer for Random Forest
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test.iloc[:500])  # subset for speed

# Global feature importance
shap.summary_plot(shap_values[1], X_test.iloc[:500],
                  feature_names=feature_cols, show=False)

In the full FP-05 analysis, the top 5 SHAP features were:

epss_percentile (mean |SHAP| = 1.096) – the single strongest predictor, nearly 2x the next feature
has_exploit_ref (0.573) – whether the CVE links to proof-of-concept code
cvss_score (0.430) – severity helps, but it is not enough alone
vendor_cve_count (0.429) – vendors with large CVE histories get targeted because their software is widely deployed
desc_length (0.367) – longer descriptions correlate with more complex (and often more exploitable) vulnerabilities

The key insight: EPSS percentile dominates because it encodes real-time threat intelligence. Vendor CVE count captures deployment ubiquity – attackers invest effort where the payoff is highest. These are signals CVSS was never designed to capture.

Verification

Check that your model is learning real signal, not memorizing noise:

from sklearn.dummy import DummyClassifier

# Sanity baseline: majority class
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)
dummy_proba = dummy.predict_proba(X_test)[:, 1]
# DummyClassifier with most_frequent will give AUC ~0.500

print(f"Majority class baseline AUC: ~0.500")
print(f"CVSS baseline AUC:           {cvss_auc:.3f}")
print(f"Your model AUC:              {rf_auc:.3f}")

If your model AUC is above 0.80, you have a working vulnerability prioritizer that materially outperforms CVSS. If it is below 0.65, check your feature engineering – the most common mistake is missing the EPSS merge.

What’s Not Solved

EPSS is hard to beat. In the full FP-05 experiment, EPSS alone achieved AUC 0.912 – slightly better than any ML model trained on public data. EPSS is itself an ML model trained on richer data (threat intelligence feeds, social media, exploit activity) that you do not have access to. Your model gets you 99% of EPSS performance using only public data, which matters for organizations that cannot afford commercial threat intel. But if you already have EPSS scores, a simple threshold (EPSS >= 0.01) is a strong baseline.

Ground truth lags. ExploitDB labels for recent CVEs are incomplete. Many exploited vulnerabilities from 2024+ have not been added yet. This is a label maturation problem, not a model problem. F1 scores will be low on recent test data regardless of the model.

Feature controllability matters. The model is naturally robust to adversarial manipulation because its top features (EPSS, CVSS, vendor history) are defender-observable – an attacker cannot change them. But if you add features derived from CVE description text, an attacker could submit misleading descriptions to manipulate triage. Build on features you control.

The full methodology, 7-algorithm comparison, ablation study, and adversarial evaluation are in the vuln-prioritization-ml repo.

Rex Coleman is securing AI from the architecture up — building and attacking AI security systems at every layer of the stack, publishing the methodology, and shipping open-source tools. rexcoleman.dev · GitHub · Singularity Cybersecurity

If this was useful, subscribe on Substack for weekly AI security research – findings, tools, and curated signal.

Problem Statement#

Prerequisites#

Step 1: Download NVD and EPSS Data#

Step 2: Engineer Features#

Step 3: Split Temporally and Train a Random Forest#

Step 4: Compare Against CVSS Baseline#

Step 5: Explain with SHAP#

Verification#

What’s Not Solved#