Skip to content

Reconsider source-injection route gating (route_sources single-bucket) in searching/pipeline #41

Description

@smodee

Context

PR #37 replaced the old lookup_dashboards(question) (deterministic, pathogen-keyed injection) with route_sources(question, llm) (a single LLM-chosen route) + lookup_yaml_sources(question, route). Deferred from the #37 review to keep the refactor moving; this issue tracks revisiting the design. Not blocking #37.

Behaviour today

run() injects only the one route route_sources() returns:

  • General and pathogen-specific sources are mutually exclusive per run.
  • An explicit question.pathogen no longer guarantees its own dashboards — the LLM route is the gate.
  • The custom scrapers (OWID/PAHO/WHO) only fire when the LLM routes to the pathogen family.
  • A misroute silently drops the relevant authoritative sources. (With the route_sources try/except added in Refactor dashboard and enhance custom scraping modules #37, a routing failure now degrades to general_sources instead of crashing.)

Concrete findings from getting #37 green (worth informing the decision)

  1. Flatten-on-unknown. In _resolve_entries, an unrecognised pathogen under a family route falls back to injecting the entire family. Verified: pathogen="unknownvirus123", route="hemorrhagic" -> all ebola + marburg sources. So a lassa/CCHF question routed to hemorrhagic gets ebola/marburg dashboards (wrong-pathogen case counts the insight stage could misread). Currently documented by test_unknown_pathogen_falls_back_to_family; decide whether it's intended or should return [].

  2. General sources are Wayback-resolved in historical mode. With route=general_sources and as_of_date set, every one of the 13 general sources calls closest_snapshot_before -> a live archive.org request. In tests this hit the network (now stubbed in the historical search tests); in production it's ~13 Wayback lookups per no-pathogen historical question. Confirm that's the intended cost/behaviour.

  3. Domain cap vs curated multi-URL-per-domain. general_sources has two who.int entries (who_don, who_fluid). Curated dashboards are cap-exempt (dashboard_lookup_bypass in cap_per_domain_and_type), so output can carry >max_docs_per_domain docs for one domain. Surfaced by test_domain_cap_respected (adjusted in Refactor dashboard and enhance custom scraping modules #37 to count organic docs only). Confirm the cap-exemption + multiple curated URLs per domain is intended.

Proposed direction (from the #37 review)

  1. Always inject general_sources as a baseline.
  2. Additionally inject the pathogen-specific family whenever question.pathogen is set (it's a structured field — no need for the LLM to rediscover it).
  3. Reserve route_sources() for genuinely pathogen-unset questions.

This restores the old determinism for the pathogen path, keeps the new general feeds, and makes the custom scrapers fire whenever the pathogen is known rather than only on a lucky route.

Suggested testing to inform the decision

  • Trajectory/backtest A/B: run the benchmark question set under (a) current single-route and (b) baseline+additive injection; compare Brier/coverage and how often the custom scrapers actually fire.
  • Misroute rate: for pathogen-set benchmark questions, measure how often route_sources() picks the correct family.
  • Injection volume/cost: count injected sources and Wayback calls per question in historical mode under each design.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions