Fix recurring CVE retrieval failures (migrate to NVD data feeds)#33
Merged
Conversation
The date-filtered NVD API returned recurring 503 errors, breaking the pipeline. This MR: - Migrates retrieve_cve.py from the date-filtered API to the static NVD JSON data feeds (reliable, same schema), with a fallback to yearly feeds for large gaps - Speeds up cwe2capec.py (removed the per-CVE thread pool used for simple dict lookups) - Lowers gzip compression to level 6 in the DB-writing scripts (~5× faster writes, +5% file size)
There was a problem hiding this comment.
Pull request overview
This PR addresses recurring CVE retrieval failures by moving CVE ingestion off the date-filtered NVD API and onto NVD’s static JSON 2.0 data feeds, while also optimizing parts of the downstream enrichment pipeline and speeding up DB writes.
Changes:
- Migrate
retrieve_cve.pyto download and stream-parse NVD JSON 2.0.json.gzfeeds (modified feed for small gaps; year feeds for larger gaps). - Simplify
cwe2capec.pyby removing a per-CVE thread pool used for simple dict lookups. - Reduce gzip compression level to 6 for faster yearly DB writes.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| retrieve_cve.py | Replaces NVD API paging with feed downloads + streaming JSON parsing to produce new_cves.jsonl. |
| requirements.txt | Adds ijson dependency to support streaming JSON parsing. |
| cwe2capec.py | Removes unnecessary threading for CWE→CAPEC lookups. |
| capec2technique.py | Lowers gzip compression for faster DB writes. |
| technique2atlas.py | Lowers gzip compression for faster DB writes. |
| technique2defend.py | Lowers gzip compression for faster DB writes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+55
to
+61
| def feeds_to_fetch(start_date: datetime, end_date: datetime): | ||
| """Pick the smallest set of feeds covering [start_date, end_date].""" | ||
| if end_date - start_date <= timedelta(days=FEED_WINDOW_DAYS): | ||
| return [MODIFIED_FEED] | ||
| # Larger gap (first run, or the job was down for a while): use the year feeds, | ||
| # which contain the complete history per year. | ||
| return [YEAR_FEED.format(year=year) for year in range(start_date.year, end_date.year + 1)] |
Comment on lines
+26
to
+31
| def parse_feed_timestamp(value: str) -> datetime: | ||
| """Parse an NVD feed 'lastModified' value (UTC, no offset) into an aware datetime.""" | ||
| dt = datetime.fromisoformat(value) | ||
| if dt.tzinfo is None: | ||
| dt = dt.replace(tzinfo=timezone.utc) | ||
| return dt |
Comment on lines
+72
to
+75
| if cwe.get("type", "") == "Primary": | ||
| cwe_code = cwe.get("description", [])[0].get("value", "") | ||
| if match(r"CWE-\d{1,4}", cwe_code): | ||
| cwe_list.append(cwe_code.split("-")[1]) |
Comment on lines
+79
to
+82
| if cwe.get("type", "") == "Secondary": | ||
| cwe_code = cwe.get("description", [])[0].get("value", "") | ||
| if match(r"CWE-\d{1,4}", cwe_code): | ||
| cwe_list.append(cwe_code.split("-")[1]) |
Comment on lines
+99
to
+113
| for feed_url in feeds: | ||
| dest = os.path.join("feeds_tmp", os.path.basename(feed_url)) | ||
| download_feed_with_retries(session, feed_url, dest) | ||
| with gzip.open(dest, "rb") as f: | ||
| for cve_wrapper in tqdm(ijson.items(f, "vulnerabilities.item"), desc=f"Processing {os.path.basename(feed_url)}", unit="CVE"): | ||
| cve = cve_wrapper.get("cve", {}) | ||
| last_modified = cve.get("lastModified") | ||
| # Keep only CVEs modified since the last run (real delta), so the | ||
| # downstream pipeline stays as light as it is today. | ||
| if last_modified and parse_feed_timestamp(last_modified) < start_date: | ||
| continue | ||
| cve_id = cve.get("id", "") | ||
| if cve_id: | ||
| cve_data[cve_id] = {"CWE": extract_cwes(cve)} | ||
| os.remove(dest) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The date-filtered NVD API returned recurring 503 errors, breaking the pipeline.
This MR:
Fix #32