You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adds an optional content cleaning mode that uses trafilatura (primary) and readability-lxml (fallback) to extract main article content from HTML, removing navigation, footers, cookie banners, and other boilerplate. This reduces LLM token usage by ~70% for typical documentation and blog pages.
Changes
File
Change
scripts/utils/content_clean.py
New file: clean_content() with trafilatura + readability fallback
scripts/utils/fetch.py
Use clean_content when CLEAN_CONTENT=True (default)
scripts/constants.py
Add CLEAN_CONTENT env toggle (WDR_CLEAN_CONTENT=0 to disable)
scripts/utils/__init__.py
Export clean_content
pyproject.toml
Add trafilatura>=1.10.0 and readability-lxml>=0.8.1
tests/test_content_clean.py
8 test cases for content cleaning
.agents/skills/.../utils.py
Updated skills snapshot with clean_content
How It Works
trafilatura — Best article extraction, handles most doc/blog pages
readability-lxml — Fallback for pages trafilatura returns None on
raw HTML strip — Last resort, strips tags with regex
Token Savings Estimate
Page Type
Raw chars
Cleaned chars
Reduction
Docs page
~18,000
~5,400
~70%
Blog post
~12,000
~4,200
~65%
GitHub README
~8,000
~6,400
~20%
Configuration
Default: Content cleaning is enabled (WDR_CLEAN_CONTENT=1)
Disable: Set WDR_CLEAN_CONTENT=0 environment variable
Metadata: ResolvedResult.metadata now includes {"cleaned": bool, "raw_length": int}
We reviewed changes in b8833c7...89b149c on this pull request. Below is the summary for the review, and you can see the individual issues we found as inline review comments.
Some issues found as part of this review are outside of the diff in this pull request and aren't shown in the inline review comments due to GitHub's API limitations. You can see those issues on the DeepSource dashboard.
AI Review is run only on demand for your team. We're only showing results of static analysis review right now. To trigger AI Review, comment @deepsourcebot review on this thread.
The reason will be displayed to describe this comment to others. Learn more.
Reimport 're' (imported line 9)
A module or an import name is reimported multiple times. This can be confusing and should be fixed.
Please refer to the occurrence message to see the reimported name and the line number where it was imported for the first time.
The reason will be displayed to describe this comment to others. Learn more.
Method doesn't use the class instance and could be converted into a static method
The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation. Read more about staticmethods here.
The reason will be displayed to describe this comment to others. Learn more.
Method doesn't use the class instance and could be converted into a static method
The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation. Read more about staticmethods here.
The reason will be displayed to describe this comment to others. Learn more.
Method doesn't use the class instance and could be converted into a static method
The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation. Read more about staticmethods here.
The reason will be displayed to describe this comment to others. Learn more.
Method doesn't use the class instance and could be converted into a static method
The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation. Read more about staticmethods here.
The reason will be displayed to describe this comment to others. Learn more.
Method doesn't use the class instance and could be converted into a static method
The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation. Read more about staticmethods here.
The reason will be displayed to describe this comment to others. Learn more.
Method doesn't use the class instance and could be converted into a static method
The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation. Read more about staticmethods here.
The reason will be displayed to describe this comment to others. Learn more.
Method doesn't use the class instance and could be converted into a static method
The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation. Read more about staticmethods here.
The reason will be displayed to describe this comment to others. Learn more.
Method doesn't use the class instance and could be converted into a static method
The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation. Read more about staticmethods here.
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer TIP This summary will be updated as you push new changes.
d-oit
changed the title
perf(quality): add trafilatura/readability content-clean mode (#491)
perf(resolver): add trafilatura/readability content-clean mode
Jul 3, 2026
- Add scripts/utils/content_clean.py with clean_content() function
- Use trafilatura as primary extractor, readability-lxml as fallback
- Add clean=True parameter to fetch_url_content (default: True)
- Add CLEAN_CONTENT env toggle (WDR_CLEAN_CONTENT=0 to disable)
- Add trafilatura and readability-lxml to pyproject.toml dependencies
- Add tests/test_content_clean.py with 8 test cases
- Update skills snapshot with clean_content function
- ResolvedResult.metadata now includes cleaned and raw_length fields
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an optional content cleaning mode that uses
trafilatura(primary) andreadability-lxml(fallback) to extract main article content from HTML, removing navigation, footers, cookie banners, and other boilerplate. This reduces LLM token usage by ~70% for typical documentation and blog pages.Changes
scripts/utils/content_clean.pyclean_content()with trafilatura + readability fallbackscripts/utils/fetch.pyclean_contentwhenCLEAN_CONTENT=True(default)scripts/constants.pyCLEAN_CONTENTenv toggle (WDR_CLEAN_CONTENT=0to disable)scripts/utils/__init__.pyclean_contentpyproject.tomltrafilatura>=1.10.0andreadability-lxml>=0.8.1tests/test_content_clean.py.agents/skills/.../utils.pyclean_contentHow It Works
Token Savings Estimate
Configuration
WDR_CLEAN_CONTENT=1)WDR_CLEAN_CONTENT=0environment variableResolvedResult.metadatanow includes{"cleaned": bool, "raw_length": int}Test Results
Closes #491