You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #37 replaced the old lookup_dashboards(question) (deterministic, pathogen-keyed injection) with route_sources(question, llm) (a single LLM-chosen route) + lookup_yaml_sources(question, route). Deferred from the #37 review to keep the refactor moving; this issue tracks revisiting the design. Not blocking #37.
Behaviour today
run() injects only the one route route_sources() returns:
General and pathogen-specific sources are mutually exclusive per run.
An explicit question.pathogen no longer guarantees its own dashboards — the LLM route is the gate.
The custom scrapers (OWID/PAHO/WHO) only fire when the LLM routes to the pathogen family.
A misroute silently drops the relevant authoritative sources. (With the route_sources try/except added in Refactor dashboard and enhance custom scraping modules #37, a routing failure now degrades to general_sources instead of crashing.)
Concrete findings from getting #37 green (worth informing the decision)
Flatten-on-unknown. In _resolve_entries, an unrecognised pathogen under a family route falls back to injecting the entire family. Verified: pathogen="unknownvirus123", route="hemorrhagic" -> all ebola + marburg sources. So a lassa/CCHF question routed to hemorrhagic gets ebola/marburg dashboards (wrong-pathogen case counts the insight stage could misread). Currently documented by test_unknown_pathogen_falls_back_to_family; decide whether it's intended or should return [].
General sources are Wayback-resolved in historical mode. With route=general_sources and as_of_date set, every one of the 13 general sources calls closest_snapshot_before -> a live archive.org request. In tests this hit the network (now stubbed in the historical search tests); in production it's ~13 Wayback lookups per no-pathogen historical question. Confirm that's the intended cost/behaviour.
Domain cap vs curated multi-URL-per-domain.general_sources has two who.int entries (who_don, who_fluid). Curated dashboards are cap-exempt (dashboard_lookup_bypass in cap_per_domain_and_type), so output can carry >max_docs_per_domain docs for one domain. Surfaced by test_domain_cap_respected (adjusted in Refactor dashboard and enhance custom scraping modules #37 to count organic docs only). Confirm the cap-exemption + multiple curated URLs per domain is intended.
Additionally inject the pathogen-specific family whenever question.pathogen is set (it's a structured field — no need for the LLM to rediscover it).
Reserve route_sources() for genuinely pathogen-unset questions.
This restores the old determinism for the pathogen path, keeps the new general feeds, and makes the custom scrapers fire whenever the pathogen is known rather than only on a lucky route.
Suggested testing to inform the decision
Trajectory/backtest A/B: run the benchmark question set under (a) current single-route and (b) baseline+additive injection; compare Brier/coverage and how often the custom scrapers actually fire.
Misroute rate: for pathogen-set benchmark questions, measure how often route_sources() picks the correct family.
Injection volume/cost: count injected sources and Wayback calls per question in historical mode under each design.
Context
PR #37 replaced the old
lookup_dashboards(question)(deterministic, pathogen-keyed injection) withroute_sources(question, llm)(a single LLM-chosen route) +lookup_yaml_sources(question, route). Deferred from the #37 review to keep the refactor moving; this issue tracks revisiting the design. Not blocking #37.Behaviour today
run()injects only the one routeroute_sources()returns:question.pathogenno longer guarantees its own dashboards — the LLM route is the gate.route_sourcestry/except added in Refactor dashboard and enhance custom scraping modules #37, a routing failure now degrades togeneral_sourcesinstead of crashing.)Concrete findings from getting #37 green (worth informing the decision)
Flatten-on-unknown. In
_resolve_entries, an unrecognised pathogen under a family route falls back to injecting the entire family. Verified:pathogen="unknownvirus123", route="hemorrhagic"-> all ebola + marburg sources. So alassa/CCHFquestion routed tohemorrhagicgets ebola/marburg dashboards (wrong-pathogen case counts the insight stage could misread). Currently documented bytest_unknown_pathogen_falls_back_to_family; decide whether it's intended or should return[].General sources are Wayback-resolved in historical mode. With
route=general_sourcesandas_of_dateset, every one of the 13 general sources callsclosest_snapshot_before-> a live archive.org request. In tests this hit the network (now stubbed in the historical search tests); in production it's ~13 Wayback lookups per no-pathogen historical question. Confirm that's the intended cost/behaviour.Domain cap vs curated multi-URL-per-domain.
general_sourceshas twowho.intentries (who_don,who_fluid). Curated dashboards are cap-exempt (dashboard_lookup_bypassincap_per_domain_and_type), so output can carry >max_docs_per_domaindocs for one domain. Surfaced bytest_domain_cap_respected(adjusted in Refactor dashboard and enhance custom scraping modules #37 to count organic docs only). Confirm the cap-exemption + multiple curated URLs per domain is intended.Proposed direction (from the #37 review)
general_sourcesas a baseline.question.pathogenis set (it's a structured field — no need for the LLM to rediscover it).route_sources()for genuinely pathogen-unset questions.This restores the old determinism for the pathogen path, keeps the new general feeds, and makes the custom scrapers fire whenever the pathogen is known rather than only on a lucky route.
Suggested testing to inform the decision
route_sources()picks the correct family.References
bioscancast/stages/searching/pipeline.pyrun()bioscancast/stages/searching/source_lookup.py_resolve_entries/lookup_yaml_sourcesbioscancast/stages/searching/query_decomposition.pyroute_sources