Reconsider source-injection route gating (route_sources single-bucket) in searching/pipeline

## Context

PR #37 replaced the old `lookup_dashboards(question)` (deterministic, pathogen-keyed injection) with `route_sources(question, llm)` (a single LLM-chosen route) + `lookup_yaml_sources(question, route)`. Deferred from the #37 review to keep the refactor moving; this issue tracks revisiting the design. Not blocking #37.

## Behaviour today

`run()` injects only the **one** route `route_sources()` returns:
- General and pathogen-specific sources are **mutually exclusive** per run.
- An explicit `question.pathogen` no longer guarantees its own dashboards — the LLM route is the gate.
- The custom scrapers (OWID/PAHO/WHO) only fire when the LLM routes to the pathogen family.
- A misroute silently drops the relevant authoritative sources. (With the `route_sources` try/except added in #37, a routing *failure* now degrades to `general_sources` instead of crashing.)

## Concrete findings from getting #37 green (worth informing the decision)

1. **Flatten-on-unknown.** In `_resolve_entries`, an unrecognised pathogen under a family route falls back to injecting the **entire family**. Verified: `pathogen="unknownvirus123", route="hemorrhagic"` -> all ebola + marburg sources. So a `lassa`/`CCHF` question routed to `hemorrhagic` gets ebola/marburg dashboards (wrong-pathogen case counts the insight stage could misread). Currently documented by `test_unknown_pathogen_falls_back_to_family`; decide whether it's intended or should return `[]`.

2. **General sources are Wayback-resolved in historical mode.** With `route=general_sources` and `as_of_date` set, every one of the 13 general sources calls `closest_snapshot_before` -> a live archive.org request. In tests this hit the network (now stubbed in the historical search tests); in production it's ~13 Wayback lookups per no-pathogen historical question. Confirm that's the intended cost/behaviour.

3. **Domain cap vs curated multi-URL-per-domain.** `general_sources` has two `who.int` entries (`who_don`, `who_fluid`). Curated dashboards are cap-exempt (`dashboard_lookup_bypass` in `cap_per_domain_and_type`), so output can carry >`max_docs_per_domain` docs for one domain. Surfaced by `test_domain_cap_respected` (adjusted in #37 to count organic docs only). Confirm the cap-exemption + multiple curated URLs per domain is intended.

## Proposed direction (from the #37 review)

1. Always inject `general_sources` as a baseline.
2. **Additionally** inject the pathogen-specific family whenever `question.pathogen` is set (it's a structured field — no need for the LLM to rediscover it).
3. Reserve `route_sources()` for genuinely pathogen-unset questions.

This restores the old determinism for the pathogen path, keeps the new general feeds, and makes the custom scrapers fire whenever the pathogen is known rather than only on a lucky route.

## Suggested testing to inform the decision

- **Trajectory/backtest A/B**: run the benchmark question set under (a) current single-route and (b) baseline+additive injection; compare Brier/coverage and how often the custom scrapers actually fire.
- **Misroute rate**: for pathogen-set benchmark questions, measure how often `route_sources()` picks the correct family.
- **Injection volume/cost**: count injected sources and Wayback calls per question in historical mode under each design.

## References

- `bioscancast/stages/searching/pipeline.py` `run()`
- `bioscancast/stages/searching/source_lookup.py` `_resolve_entries` / `lookup_yaml_sources`
- `bioscancast/stages/searching/query_decomposition.py` `route_sources`
- PR #37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reconsider source-injection route gating (route_sources single-bucket) in searching/pipeline #41

Context

Behaviour today

Concrete findings from getting #37 green (worth informing the decision)

Proposed direction (from the #37 review)

Suggested testing to inform the decision

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Reconsider source-injection route gating (route_sources single-bucket) in searching/pipeline #41

Description

Context

Behaviour today

Concrete findings from getting #37 green (worth informing the decision)

Proposed direction (from the #37 review)

Suggested testing to inform the decision

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions