Skip to content

Q9 (H5N1 dairy-cattle states): automate USDA APHIS count - Tableau dashboard behind Akamai; decide snapshot vs headless browser #50

Description

@smodee

Summary

BFG summer-2026 Q9 ("How many US states will have cumulative H5N1 dairy-cattle detections by 31 Dec 2026?", resolution source = USDA APHIS livestock detections) cannot get its authoritative number through the pipeline. The injected APHIS source thin-extracts (~492 chars, map instructions only) and Q9 gets 0 insight records. We need to decide how to feed Q9 the current state count. Decision deferred — this issue captures the investigation and the interim code (deliberately held out of the BFG-readiness PR).

Other H5 questions (Q6/Q7/Q8) are unaffected — they get human-case data from WHO/CDC.

What we found (investigated with claude-in-chrome + a controlled headless test)

  • The APHIS page (.../hpai-detections/hpai-confirmed-cases-livestock) embeds a USDA Tableau dashboard — not ArcGIS as an earlier note assumed: publicdashboards.dl.usda.gov/t/MRP_PUB/views/VS_Cattle_HPAIConfirmedDetections2024/HPAI2022ConfirmedDetections.
  • The state-level data is served only by Tableau's session-bound call POST /vizql/t/MRP_PUB/w/VS_Cattle_HPAIConfirmedDetections2024/v/HPAI2022ConfirmedDetections/bootstrapSession/sessions/{sessionId}. The map itself is server-rendered PNG tiles, so there is no client-side data file to grab.
  • That POST needs a server-minted {sessionId} that Tableau delivers only to a client which passes Akamai Bot Manager's JavaScript challenge. A real browser does this automatically.
  • Our pipeline fetcher is curl_cffi (no JS engine). Even with Chrome-TLS impersonation + full browser headers (Referer/Sec-Fetch/iframe dest), it receives a config-less stub (tsConfigContainer empty, no session id, only AKA_A2/ak_bmsc cookies). So a headless HTTP scraper is not possible — the blocker is session acquisition, not request-shape.
  • No prose/CSV alternative states the cumulative count as an extractable "N states" fact (CDC dairy-cows page has context but not the figure; AVMA is bot-blocked for our fetcher; the count is only cleanly available inside the Tableau dashboard).

Current value (from BFG baseline + web search, not scraped from the dashboard): ~19 states / ~989 herds cumulatively since March 2024 (Wisconsin the most recent addition). Q9 baseline bin is "19 or fewer".

Options (to decide)

  1. Curated snapshot (interim, implemented below). A custom_scrapers/usda_aphis_livestock.py that renders a hand-maintained "N states as of DATE" into clean HTML → 1 clean insight record; live dashboard stays the resolution source. Simplest, no new deps. Cost: manual weekly refresh of one dict.
  2. Separate headless-browser refresh job (Playwright/Selenium) that loads the dashboard, reads the count, and writes it into the snapshot — run as step 0 of each forecast refresh. Fully automated, keeps the browser dependency out of the main curl-based pipeline. Open risk: Akamai may also block headless/automated browsers unless stealth-configured; needs a feasibility test.
  3. Human/Claude-in-the-loop refresh via a real browser each cycle. Reliable (a real browser clears Akamai), but not hands-off.

Interim code (held out of the PR)

Also held back: a cdc_h5n1_mammals source entry (CDC dairy-cows page) added under respiratory.h5n1 / animal_spillover.h5n1 for dairy context.

bioscancast/stages/extraction/custom_scrapers/usda_aphis_livestock.py:

"""Curated snapshot for USDA APHIS HPAI-in-livestock (``usda_aphis_livestock``).
See issue: the count lives in a USDA Tableau dashboard behind Akamai + a JS-created
VizQL session; no headless bootstrap is reproducible, so the page yields 0 records.
This renders a small, explicitly-dated snapshot of the cumulative state count into
clean HTML so the insight extractor gets one unambiguous "N states as of DATE" fact.
MAINTENANCE: refresh ``SNAPSHOT`` weekly from the live APHIS dashboard.
"""
from __future__ import annotations
import html as html_lib
from datetime import datetime, timezone
from bioscancast.stages.extraction.config import ExtractionConfig
from bioscancast.stages.extraction.fetcher import FetchResult

SNAPSHOT = {
    "as_of": "2026-07-15",
    "states_count": 19,
    "states": ["Arizona","California","Colorado","Idaho","Iowa","Kansas","Michigan",
        "Minnesota","Nebraska","Nevada","New Mexico","North Carolina","Ohio","Oklahoma",
        "South Dakota","Texas","Utah","Wisconsin","Wyoming"],
    "herds_affected": 989,
    "since": "March 2024",
    "source_url": "https://www.aphis.usda.gov/livestock-poultry-disease/avian/avian-influenza/hpai-detections/hpai-confirmed-cases-livestock",
}

def fetch(url, *, config: ExtractionConfig | None = None, as_of_date: datetime | None = None,
          region: str | None = None, question_text: str | None = None) -> FetchResult | None:
    snap = SNAPSHOT
    states = ", ".join(snap["states"])
    body = (
        f"<h1>USDA APHIS - HPAI H5N1 in Livestock (curated snapshot)</h1>"
        f"<p>As of {html_lib.escape(snap['as_of'])}, USDA APHIS has confirmed HPAI H5N1 in "
        f"dairy cattle in {snap['states_count']} US states cumulatively since "
        f"{html_lib.escape(snap['since'])}, with approximately {snap['herds_affected']} herds affected.</p>"
        f"<p>The {snap['states_count']} affected states are: {html_lib.escape(states)}.</p>"
        f"<p>Snapshot of the USDA APHIS HPAI Confirmed Cases in Livestock dashboard "
        f"(the live dashboard is the resolution source): {html_lib.escape(snap['source_url'])}.</p>"
    )
    rendered = ("<html><head><meta charset='utf-8'><title>USDA APHIS HPAI livestock snapshot</title></head>"
                f"<body>{body}</body></html>").encode("utf-8")
    return FetchResult(url=snap["source_url"], final_url=snap["source_url"], status_code=200,
        content_type="text/html", content_bytes=rendered, fetched_at=datetime.now(timezone.utc), error=None)

With this snapshot in place, Q9 went from 0 → 1 insight record and the forecast anchored sensibly (19-or-fewer=0.20, 20=0.15, 21–22=0.24, 23–25=0.25, 26+=0.16).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions