Skip to content

fix(document-extractor): add PDF text fallback#189

Open
luochen211 wants to merge 1 commit into
langgenius:mainfrom
luochen211:fix-pdf-empty-text-fallback
Open

fix(document-extractor): add PDF text fallback#189
luochen211 wants to merge 1 commit into
langgenius:mainfrom
luochen211:fix-pdf-empty-text-fallback

Conversation

@luochen211

@luochen211 luochen211 commented Jun 16, 2026

Copy link
Copy Markdown

Important

  1. Make sure you have read our contribution guidelines
  2. Search existing issues and pull requests to confirm this change is not a duplicate
  3. Open or identify the issue this pull request resolves or advances
  4. Use a Conventional Commits title for this pull request, and mark breaking changes with !
  5. Remember that the pull request title will become the squash merge commit message
  6. If CLA Assistant prompts you, sign CLA.md in the pull request conversation

Related Issue

Closes #188
Refs langgenius/dify#37488

Summary

  • Add a pypdf fallback when PDFium extracts only empty or whitespace text from a PDF.
  • Keep PDFium as the primary parser and preserve the original empty result if the fallback also has no text or cannot parse the file.
  • Add regression coverage for the empty-PDFium-output path and declare pypdf as a direct dependency.

Validation

  • uv lock --check
  • uv run ruff format --check src/graphon/nodes/document_extractor/node.py tests/nodes/document_extractor/test_dispatch.py
  • uv run ruff check src/graphon/nodes/document_extractor/node.py tests/nodes/document_extractor/test_dispatch.py
  • uv run ty check
  • uv run pytest tests/nodes/document_extractor
  • uv run pytest

Checklist

  • This pull request links the issue it resolves or advances
  • This pull request title follows Conventional Commits, and any breaking change is marked with !
  • If CLA Assistant prompted me, I signed CLA.md in the pull request conversation

@dosubot dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jun 16, 2026
@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown

All contributors on this pull request have signed the CLA.
Posted by the CLA Assistant Lite bot.

@luochen211

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

@luochen211

Copy link
Copy Markdown
Author

recheck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document extractor returns empty text for some PDFs

1 participant