Skip to content

WikipediaSearcher: no retry, and transient failures poison the cache as empty results #209

@dprodger

Description

@dprodger

Summary

WikipediaSearcher (backend/integrations/wikipedia/utils.py) has no retry/backoff on its Wikipedia API calls, and — worse — it caches transient failures as permanent "no result" for 7 days.

In search_wikipedia():

The page-fetch path (verify_wikipedia_reference) is the same shape: non-200 / exception → treated as "not a valid page."

Impact

A momentary Wikipedia hiccup (timeout, 429, 503, transient network error) during a bulk run is indistinguishable from "this performer has no Wikipedia page." The empty result is then cached for 7 days, so a performer who does have an article is silently skipped on every subsequent run until the cache expires or someone passes --force-refresh.

This surfaced while bulk-running scripts/verify_performer_references.py --reftype wikipedia --onlynew over ~31k performers — exactly the scenario where transient errors are most likely and most damaging.

Proposal

  • Add retry + exponential backoff on 503/429/timeout/connection errors, mirroring the MusicBrainz client (integrations/musicbrainz/client.py:685-692).
  • Distinguish "no result" from "request failed." Only cache an empty result when the API genuinely returned no match (HTTP 200, empty list). On a transient error, return None without poisoning the cache, so a later run retries.

Acceptance criteria

  • Transient Wikipedia errors are retried with backoff before giving up
  • A failed request never writes an empty-result cache entry
  • A genuine "no page" (200 + empty) still caches, preserving the fast-resume behavior

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdata-cleanupprojected related to the underlying metadata, scrapers, ingesters, etc.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions