Skip to content

Fix recurring CVE retrieval failures (migrate to NVD data feeds)#33

Merged
tarraschk merged 1 commit into
mainfrom
Fix_cve_retrieve
Jun 19, 2026
Merged

Fix recurring CVE retrieval failures (migrate to NVD data feeds)#33
tarraschk merged 1 commit into
mainfrom
Fix_cve_retrieve

Conversation

@Darkiros

Copy link
Copy Markdown
Collaborator

The date-filtered NVD API returned recurring 503 errors, breaking the pipeline.

This MR:

  • Migrates retrieve_cve.py from the date-filtered API to the static NVD JSON data feeds (reliable, same schema), with a fallback to yearly feeds for large gaps
  • Speeds up cwe2capec.py (removed the per-CVE thread pool used for simple dict lookups)
  • Lowers gzip compression to level 6 in the DB-writing scripts (~5× faster writes, +5% file size)

Fix #32

The date-filtered NVD API returned recurring 503 errors, breaking the pipeline.

This MR:

- Migrates retrieve_cve.py from the date-filtered API to the static NVD JSON data feeds (reliable, same schema), with a fallback to yearly feeds for large gaps
- Speeds up cwe2capec.py (removed the per-CVE thread pool used for simple dict lookups)
- Lowers gzip compression to level 6 in the DB-writing scripts (~5× faster writes, +5% file size)
@Darkiros Darkiros requested a review from tarraschk June 19, 2026 20:22
@Darkiros Darkiros self-assigned this Jun 19, 2026
Copilot AI review requested due to automatic review settings June 19, 2026 20:22

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses recurring CVE retrieval failures by moving CVE ingestion off the date-filtered NVD API and onto NVD’s static JSON 2.0 data feeds, while also optimizing parts of the downstream enrichment pipeline and speeding up DB writes.

Changes:

  • Migrate retrieve_cve.py to download and stream-parse NVD JSON 2.0 .json.gz feeds (modified feed for small gaps; year feeds for larger gaps).
  • Simplify cwe2capec.py by removing a per-CVE thread pool used for simple dict lookups.
  • Reduce gzip compression level to 6 for faster yearly DB writes.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
retrieve_cve.py Replaces NVD API paging with feed downloads + streaming JSON parsing to produce new_cves.jsonl.
requirements.txt Adds ijson dependency to support streaming JSON parsing.
cwe2capec.py Removes unnecessary threading for CWE→CAPEC lookups.
capec2technique.py Lowers gzip compression for faster DB writes.
technique2atlas.py Lowers gzip compression for faster DB writes.
technique2defend.py Lowers gzip compression for faster DB writes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread retrieve_cve.py
Comment on lines +55 to +61
def feeds_to_fetch(start_date: datetime, end_date: datetime):
"""Pick the smallest set of feeds covering [start_date, end_date]."""
if end_date - start_date <= timedelta(days=FEED_WINDOW_DAYS):
return [MODIFIED_FEED]
# Larger gap (first run, or the job was down for a while): use the year feeds,
# which contain the complete history per year.
return [YEAR_FEED.format(year=year) for year in range(start_date.year, end_date.year + 1)]
Comment thread retrieve_cve.py
Comment on lines +26 to +31
def parse_feed_timestamp(value: str) -> datetime:
"""Parse an NVD feed 'lastModified' value (UTC, no offset) into an aware datetime."""
dt = datetime.fromisoformat(value)
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
return dt
Comment thread retrieve_cve.py
Comment on lines +72 to +75
if cwe.get("type", "") == "Primary":
cwe_code = cwe.get("description", [])[0].get("value", "")
if match(r"CWE-\d{1,4}", cwe_code):
cwe_list.append(cwe_code.split("-")[1])
Comment thread retrieve_cve.py
Comment on lines +79 to +82
if cwe.get("type", "") == "Secondary":
cwe_code = cwe.get("description", [])[0].get("value", "")
if match(r"CWE-\d{1,4}", cwe_code):
cwe_list.append(cwe_code.split("-")[1])
Comment thread retrieve_cve.py
Comment on lines +99 to +113
for feed_url in feeds:
dest = os.path.join("feeds_tmp", os.path.basename(feed_url))
download_feed_with_retries(session, feed_url, dest)
with gzip.open(dest, "rb") as f:
for cve_wrapper in tqdm(ijson.items(f, "vulnerabilities.item"), desc=f"Processing {os.path.basename(feed_url)}", unit="CVE"):
cve = cve_wrapper.get("cve", {})
last_modified = cve.get("lastModified")
# Keep only CVEs modified since the last run (real delta), so the
# downstream pipeline stays as light as it is today.
if last_modified and parse_feed_timestamp(last_modified) < start_date:
continue
cve_id = cve.get("id", "")
if cve_id:
cve_data[cve_id] = {"CWE": extract_cwes(cve)}
os.remove(dest)
@tarraschk tarraschk merged commit 9a97556 into main Jun 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Recurring failures when retrieving CVEs

3 participants