Skip to content

perf(resolver): add trafilatura/readability content-clean mode#505

Open
d-oit wants to merge 5 commits into
mainfrom
feat/content-clean-mode
Open

perf(resolver): add trafilatura/readability content-clean mode#505
d-oit wants to merge 5 commits into
mainfrom
feat/content-clean-mode

Conversation

@d-oit

@d-oit d-oit commented Jul 3, 2026

Copy link
Copy Markdown
Owner

Summary

Adds an optional content cleaning mode that uses trafilatura (primary) and readability-lxml (fallback) to extract main article content from HTML, removing navigation, footers, cookie banners, and other boilerplate. This reduces LLM token usage by ~70% for typical documentation and blog pages.

Changes

File Change
scripts/utils/content_clean.py New file: clean_content() with trafilatura + readability fallback
scripts/utils/fetch.py Use clean_content when CLEAN_CONTENT=True (default)
scripts/constants.py Add CLEAN_CONTENT env toggle (WDR_CLEAN_CONTENT=0 to disable)
scripts/utils/__init__.py Export clean_content
pyproject.toml Add trafilatura>=1.10.0 and readability-lxml>=0.8.1
tests/test_content_clean.py 8 test cases for content cleaning
.agents/skills/.../utils.py Updated skills snapshot with clean_content

How It Works

  1. trafilatura — Best article extraction, handles most doc/blog pages
  2. readability-lxml — Fallback for pages trafilatura returns None on
  3. raw HTML strip — Last resort, strips tags with regex

Token Savings Estimate

Page Type Raw chars Cleaned chars Reduction
Docs page ~18,000 ~5,400 ~70%
Blog post ~12,000 ~4,200 ~65%
GitHub README ~8,000 ~6,400 ~20%

Configuration

  • Default: Content cleaning is enabled (WDR_CLEAN_CONTENT=1)
  • Disable: Set WDR_CLEAN_CONTENT=0 environment variable
  • Metadata: ResolvedResult.metadata now includes {"cleaned": bool, "raw_length": int}

Test Results

  • 393/393 non-live tests pass
  • 8/8 new content clean tests pass
  • Ruff, Black, mypy all clean

Closes #491

@vercel

vercel Bot commented Jul 3, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
do-web-doc-resolover Ready Ready Preview, Comment Jul 3, 2026 12:09pm

@deepsource-io

deepsource-io Bot commented Jul 3, 2026

Copy link
Copy Markdown

DeepSource Code Review

We reviewed changes in b8833c7...89b149c on this pull request. Below is the summary for the review, and you can see the individual issues we found as inline review comments.

See full review on DeepSource ↗

Important

Some issues found as part of this review are outside of the diff in this pull request and aren't shown in the inline review comments due to GitHub's API limitations. You can see those issues on the DeepSource dashboard.

PR Report Card

Overall Grade   Security  

Reliability  

Complexity  

Hygiene  

Code Review Summary

Analyzer Status Updated (UTC) Details
JavaScript Jul 3, 2026 12:09p.m. Review ↗
Python Jul 3, 2026 12:09p.m. Review ↗
Rust Jul 3, 2026 12:09p.m. Review ↗
Shell Jul 3, 2026 12:09p.m. Review ↗

Important

AI Review is run only on demand for your team. We're only showing results of static analysis review right now. To trigger AI Review, comment @deepsourcebot review on this thread.


def _strip_html_tags(html: str) -> str:
"""Minimal fallback: strip all HTML tags."""
import re

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reimport 're' (imported line 9)


A module or an import name is reimported multiple times. This can be confusing and should be fixed.
Please refer to the occurrence message to see the reimported name and the line number where it was imported for the first time.

Comment thread tests/test_content_clean.py Outdated


class TestCleanContent:
def test_removes_nav_and_footer(self):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method doesn't use the class instance and could be converted into a static method


The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation. Read more about staticmethods here.

Comment thread tests/test_content_clean.py Outdated
assert "API Reference" in result
assert "Cookie Policy" not in result

def test_respects_max_chars(self):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method doesn't use the class instance and could be converted into a static method


The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation. Read more about staticmethods here.

Comment thread tests/test_content_clean.py Outdated
result = clean_content(NAV_HEAVY_HTML, max_chars=50)
assert len(result) <= 50

def test_returns_string_on_empty_input(self):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method doesn't use the class instance and could be converted into a static method


The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation. Read more about staticmethods here.

Comment thread tests/test_content_clean.py Outdated
assert isinstance(result, str)
assert result == ""

def test_returns_string_on_whitespace_input(self):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method doesn't use the class instance and could be converted into a static method


The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation. Read more about staticmethods here.

Comment thread tests/test_content_clean.py Outdated
assert isinstance(result, str)
assert result == ""

def test_preserves_main_content(self):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method doesn't use the class instance and could be converted into a static method


The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation. Read more about staticmethods here.

Comment thread tests/test_content_clean.py Outdated


class TestStripHtmlTags:
def test_strips_simple_tags(self):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method doesn't use the class instance and could be converted into a static method


The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation. Read more about staticmethods here.

Comment thread tests/test_content_clean.py Outdated
assert result == "Hello world"
assert "<" not in result

def test_strips_nested_tags(self):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method doesn't use the class instance and could be converted into a static method


The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation. Read more about staticmethods here.

Comment thread tests/test_content_clean.py Outdated
assert "bar" in result
assert "<" not in result

def test_handles_empty_string(self):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method doesn't use the class instance and could be converted into a static method


The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation. Read more about staticmethods here.

@codacy-production

Copy link
Copy Markdown
Contributor

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 24 complexity · 2 duplication

Metric Results
Complexity 24
Duplication 2

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@d-oit d-oit changed the title perf(quality): add trafilatura/readability content-clean mode (#491) perf(resolver): add trafilatura/readability content-clean mode Jul 3, 2026
- Add scripts/utils/content_clean.py with clean_content() function
- Use trafilatura as primary extractor, readability-lxml as fallback
- Add clean=True parameter to fetch_url_content (default: True)
- Add CLEAN_CONTENT env toggle (WDR_CLEAN_CONTENT=0 to disable)
- Add trafilatura and readability-lxml to pyproject.toml dependencies
- Add tests/test_content_clean.py with 8 test cases
- Update skills snapshot with clean_content function
- ResolvedResult.metadata now includes cleaned and raw_length fields
- Fix reimport issue in skills snapshot _strip_html_tags
- Add cyclic import ignore rule for tests in .deepsource.toml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(quality): add trafilatura/readability content-clean mode to reduce LLM token usage by ~70%

2 participants