Skip to content

feat(catalog)!: add staged provider catalog indexing#119

Open
Zzackllack wants to merge 45 commits into
mainfrom
provider-catalog-index
Open

feat(catalog)!: add staged provider catalog indexing#119
Zzackllack wants to merge 45 commits into
mainfrom
provider-catalog-index

Conversation

@Zzackllack

@Zzackllack Zzackllack commented Apr 29, 2026

Copy link
Copy Markdown
Owner

Adds a persistent, SQLite-backed provider catalog index for AniBridge and moves catalog-dependent Torznab/search behavior away from request-time provider probing.

This introduces scheduled provider indexing with staged title, detail, and canonical enrichment so AniBridge can build local provider metadata progressively, expose readiness/progress through health endpoints, and serve searches from indexed database state once the catalog is ready.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Testing

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Screenshots (if applicable)

Additional Notes

Breaking changes

  • Catalog-dependent Torznab/search requests may now return an initialization/503 response until the local provider catalog title index is ready.
  • Search and tvsearch behavior now depends on the persisted catalog index and canonical mapping tables instead of request-time provider crawling/probing.
  • Existing deployments must apply the new Alembic migrations before using the indexed catalog paths.
  • First startup may require a local provider catalog bootstrap before Sonarr/Radarr queries return normal results.

What changed

  • Added a new app.catalog indexing layer for provider catalog discovery, enrichment, readiness checks, and progress reporting.
  • Added database schema and migrations for provider titles, aliases, episodes, episode languages, canonical series/episodes, provider mappings, generation tracking, and staged index status.
  • Updated Torznab search, movie search, and tvsearch paths to use indexed provider and canonical mapping data instead of doing uncontrolled live provider crawls during requests.
  • Added catalog readiness gating so catalog-dependent requests return a clear initializing response while the local index is not ready.
  • Added /health catalog status data and a dedicated /health/catalog endpoint.
  • Reworked provider indexing into memory-bounded staged workers with bounded queues, batch persistence, SQLite-safe serialized writes, retry/backoff handling, and staged readiness states.
  • Added safer defaults and environment configuration for provider refresh cadence, crawler concurrency, queue sizes, writer batching, canonical enrichment, direct-link timeout, and scheduler progress flushing.
  • Improved operational stability around SQLite by coalescing download progress writes and deferring qBittorrent worker startup until task/job metadata is persisted.
  • Hardened downloader host resolution with per-host direct-link timeouts and better logging for provider/host resolution failures.
  • Stabilized Megakino startup/indexing by resolving the active domain before catalog startup and bypassing the shared pooled HTTP client for resolver/sitemap requests.
  • Updated Docker/runtime behavior for dynamic API ports, mounted DATA_DIR logs, and writable runtime home defaults.
  • Added internal specs documenting the scheduled provider index, canonical normalization/mapping strategy, streaming persistence, memory bounds, and staged indexing design.
  • Expanded unit, integration, migration, and regression coverage for indexed Torznab flows, catalog readiness, indexer behavior, provider parsing, staged readiness backfills, SQLite write-race fixes, progress write coalescing, and downloader timeout handling.

Why

The previous request-driven/provider-live behavior made cold searches expensive and unpredictable. Large provider crawls could also retain too much state in memory, especially during first bootstrap or broad provider refreshes.

This PR makes provider catalog data local, durable, progressively refreshed, and observable. It also keeps the default SQLite deployment viable by bounding indexing memory usage, reducing write contention, and avoiding request-path full provider crawls.

Operational notes

  • Adds Alembic migrations for the provider catalog index and staged index state.
  • Fresh installs will need to build the local provider catalog before catalog-dependent Torznab requests are fully usable.
  • Health responses now expose catalog bootstrap and per-provider/stage progress.
  • Existing successful catalog generations can continue to be served while replacement generations are built.
  • Provider-derived catalog data is computed locally; this does not add a bundled provider catalog snapshot.

Testing

This branch adds or updates coverage for:

  • indexed Torznab search and tvsearch behavior
  • catalog bootstrap/readiness responses
  • catalog indexer scheduling, recovery, and progress reporting
  • provider parsing for AniWorld and s.to episode/language data
  • canonical episode deduplication
  • staged readiness migration/backfill behavior
  • qBittorrent download startup SQLite race prevention
  • scheduler progress write coalescing
  • provider direct-link timeout handling
  • Docker runtime home and terminal log path defaults

Summary by CodeRabbit

  • New Features

    • Background provider catalog indexing with configurable refresh/concurrency, staged indexing/enrichment, canonical matching, and cache controls.
    • Health endpoint now includes catalog progress and a dedicated catalog health route.
    • Scheduling: separate create/start flows, optional autostart, and periodic job-progress persistence/flush.
  • Bug Fixes

    • Enforced direct-link resolution timeouts and clearer, redacted download/fallback logs.
    • More consistent torrent state reporting across APIs.
  • Tests

    • Expanded unit and integration coverage for cataloging, Torznab, scheduler, provider resolution, migrations, and downloads.

Review Change Stack

@Zzackllack Zzackllack requested a review from Copilot April 29, 2026 23:04
@Zzackllack Zzackllack self-assigned this Apr 29, 2026
@Zzackllack Zzackllack added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request labels Apr 29, 2026
@coderabbitai

This comment was marked as low quality.

@github-actions

This comment was marked as resolved.

This comment was marked as outdated.

chatgpt-codex-connector[bot]

This comment was marked as outdated.

cubic-dev-ai[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

@coderabbitai coderabbitai Bot added the dependencies Pull requests that update a dependency file label Apr 30, 2026
chatgpt-codex-connector[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

chatgpt-codex-connector[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as resolved.

chatgpt-codex-connector[bot]

This comment was marked as resolved.

chatgpt-codex-connector[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

chatgpt-codex-connector[bot]

This comment was marked as resolved.

@Zzackllack

This comment was marked as resolved.

@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

chatgpt-codex-connector[bot]

This comment was marked as resolved.

Implement the scheduled provider catalog index and canonical mapping
layer for AniBridge.

Move provider catalog discovery out of the request path and into a
persistent SQLite-backed index with bootstrap gating, provider refresh
status tracking, and canonical TV episode mappings.

Update Torznab search and tvsearch to serve indexed provider and mapping
data instead of triggering request-time title refresh or episode probing.

Expose bootstrap visibility through health endpoints and terminal
progress output so first-time catalog builds are easier to monitor.
Update integration and migration tests for the catalog-indexed
Torznab flow.

Cover bootstrap blocking, indexed generic search, canonical tvsearch
mapping, special mapping behavior, health progress reporting, and
dynamic Alembic head verification.
Use DATA_DIR as the default base for the generated terminal log path
instead of the container working directory.

This keeps development Compose logs on the mounted /data volume rather
than writing them into /app/data inside the container.
Zzackllack and others added 26 commits June 25, 2026 02:57
Replace whole-provider buffering with a bounded crawl-to-writer pipeline.

Add queue backpressure, batch SQLite persistence, interrupted staging
cleanup, richer health progress, and generation-aware provider mappings
so partial refresh state never becomes live.
Redesign provider catalog indexing into durable title, detail, and canonical stages with DB-backed request reads.

Add SQLite-safe serialized catalog writes, WAL/busy-timeout configuration, bounded canonical caches, and retry backoff that respects future retry timestamps.

Update defaults for safe self-hosted concurrency and add targeted coverage for staged readiness, incremental persistence, cache bounds, and scheduler backoff.
Defer background worker startup until after the qBittorrent add

request has persisted its client task and job metadata.

This prevents the request thread and scheduler worker from

contending on the same SQLite rows during the initial add flow,

which was surfacing as Sonarr download warnings.

Add regression coverage for deferred worker startup in the

qBittorrent-compatible API.
Normalize legacy provider index rows into the staged readiness

model without assuming every persisted status object already has

the new stage attributes populated.

This keeps health output consistent after the staged catalog

migration and prevents bootstrap-ready providers from reporting

pending title index state due to missing backfilled fields.

Add focused tests for interrupted-state recovery and legacy stage

backfill behavior.
Co-authored-by: Copilot <copilot@github.com>
Add a hard timeout around provider direct-link resolution so a stuck
upstream host cannot leave a job in downloading forever.

Expose the timeout as PROVIDER_DIRECT_LINK_TIMEOUT_SECONDS and update
the example environment configuration accordingly.

Add regression coverage for timed-out host resolution and config
parsing of the new timeout value.
Move yt-dlp progress persistence out of the callback hot path

and flush only the latest per-job snapshot from a single writer.

This removes concurrent progress-hook writes for the same job,

reduces SQLite write pressure, and prevents repeated database

locked errors during active downloads.
Enhance logging messages to provide clearer context during download
host resolution and retry attempts. This includes details about
preferred and resolved providers, as well as error messages for
failed downloads.
Enhance error reporting by capturing exception messages during title
indexing failures. This change ensures that the error details are
passed correctly to the failure handling functions, improving
debugging and logging capabilities.
…ookup

Add functionality to resolve titles from an in-memory index before querying the database. This improves performance by prioritizing faster lookups and reduces reliance on database access when the catalog is ready.
bypass catalog readiness for empty torznab search test responses

honor STRM_FILES_MODE=only in indexed torznab item emission

fall back to default provider languages when episode language rows are absent

bound provider timeout workers and add cooperative crawl cancellation

redact resolved direct URLs from download logs

only mark qBittorrent tasks downloading after worker startup succeeds

prevent non-flush scheduler shutdown from writing stale progress

move health snapshot handlers off the event loop
store provider catalog titles, aliases, episodes, and episode languages per generation

scope staged replacements to the target generation

keep live catalog generations intact during failed staged refreshes

add migration to rebuild provider catalog tables with generation-aware primary keys

implement real downgrade paths for provider index stage migration

make generation rollback inserts PK-safe in provider mapping downgrade

BREAKING CHANGE: provider catalog tables now use generation-aware primary keys. Any direct SQL, external tooling, or handwritten migrations that assumed uniqueness on provider/slug or provider/slug/season/episode without indexed_generation must be updated before deploying this schema change.
- stub catalog readiness in low-confidence overlap unit coverage
- initialize schema explicitly for DB-backed alias lookup coverage
- remove dependency on pre-existing local test database state
- pin STRM_FILES_MODE in the torznab hard-cap integration test
- bind title_resolver to the test database engine in alias lookup coverage
- remove CI dependence on ambient module import order and env defaults
- fail closed when the title index writer cannot be stopped cleanly
- avoid marking enrichment stages ready while rows are only deferred by retry windows
- remove implicit commits from provider generation cleanup helpers
- handle missing catalog tables in title resolver DB fallbacks
- pre-register running jobs before executor submit to avoid stale RUNNING entries
- mark qBittorrent client tasks failed when background job startup fails
- guard provider direct-link timeouts with daemon threads instead of the shared executor
- seed enough mapped episodes for torznab hard-cap coverage
- bootstrap providerindexstatus via migrations in title resolver tests
- add regressions for fast-finishing scheduler jobs and deferred stage retries
Isolate timed-out provider detail crawls so stuck tasks cannot starve
later title enrichment work. Preserve queued torrent state semantics in
the qBittorrent shim, dedupe duplicate episode-language inserts, and
ensure title-index failures are still persisted even when writer shutdown
also errors.

Also block title refresh while enrichment rows are still unfinished and
treat OperationalError during indexed title lookups as a catalog-not-ready
fallback instead of bubbling the exception.
Collapse ambiguous provider episode title matches into a single conflict
mapping, filter indexed title generations in SQL, and allow site-scoped
DB title lookup before all providers are bootstrapped.
Track timed-out direct-link lookup workers per episode and skip further
provider fallback attempts while the original lookup is still running.
Clear the catalog indexer stop flag before starting a new scheduler
thread so reused interpreter lifecycles can restart indexing.
Query a small batch of indexed title candidates for site-scoped slug
resolution and return the first candidate whose rescored title match
meets the resolver threshold.
Allow provider fallback to continue after a host timeout while suppressing
duplicate hung provider attempts. Gate movie Torznab searches on Megakino
readiness only and persist row-stage retry deadlines when parking pending
catalog enrichment stages.
@Zzackllack Zzackllack force-pushed the provider-catalog-index branch from 8043103 to 3d23b78 Compare June 25, 2026 01:02
@cloudflare-workers-and-pages

This comment was marked as resolved.

Scope Torznab searches to ready provider indexes while preserving
synthetic validation and optional fallback behavior.

Persist download metadata for paused tasks and add qBittorrent-compatible
resume handling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants