-
-
-
-
# Train Market Data - /* | *\ - COLLAB WITH ERIKA out SOOON
-
-
-
This project turns the Train market dataset into a static directory site designed for GitHub
Pages. It behaves like a searchable catalog, with a landing page for discovery and a dedicated
profile view for individual listings.
- Astro static site for presentation and page generation
- Python build step that reads
../../erika/artifacts/ticker_checklist.csv - Static JSON catalog served from
public/data/catalog.json - Server-rendered summary data written to
src/generated/catalog-meta.json
The site does not require a runtime backend. Search and filtering happen in the browser against the generated JSON catalog.
The sibling Game Next.js app (…/cursor/game) hosts the interactive predictions strip (flat
$100 → payout framing, reference-only demo), the merged /live route with μgrad sports-field
and Bloomberg side-chat iframes, and env-driven embed URLs (NEXT_PUBLIC_UGRAD_SPORTS_URL,
NEXT_PUBLIC_UGRAD_TOOLS_DECK_URL, NEXT_PUBLIC_BLOOMBERG_CHAT_URL).
Train ships dedicated routes for each top tab: /directory/ (full catalog explorer with filters
and pagination), /predictions/ (Game-style card grid), /sports/ (streaming-style layout shell
for live tiles and rails — static on Pages), /broadcast/ (placeholder roadmap for encoder / pipeline /
collab surfaces; optional PUBLIC_BROADCAST_PIPELINE_URL for your live workpad origin), plus Featured,
Listing, and Raw. The
site header uses pill navigation with the Directory search field first, then the eyebrow and
title, then the educational row, then the tab row (Home, Featured, Predictions, Sports, Broadcast, Models, Directory, Listing,
Raw, Captions, Ledger) plus optional Game / Live when PUBLIC_GAME_CONSOLE_URL is set. From any non-directory page,
pressing Enter in the header search jumps to /directory/?q=….
Train’s homepage links out when you set at build time:
PUBLIC_GAME_CONSOLE_URL— Game origin with no trailing slash (for examplehttp://localhost:3000or your deployed base).PUBLIC_BROADCAST_PIPELINE_URL(optional) — Origin with no trailing slash for a separate live pipeline / encoder workpad; the Broadcast tab shows an “Open pipeline workpad” button when set.PUBLIC_BROADCAST_EMBED_HOSTS(optional) — Comma-separated hostnames (e.g.127.0.0.1,localhost,pipeline.example.com). When set, the Broadcast iframe loads only if the pipeline URL’s host matches one entry. When unset, anyhttp/httpspipeline URL from the variable may embed (still no user/password in the URL).
Example:
PUBLIC_GAME_CONSOLE_URL=http://localhost:3000 npm run buildPipeline workpad (optional):
PUBLIC_BROADCAST_PIPELINE_URL=http://127.0.0.1:8787 PUBLIC_GAME_CONSOLE_URL=http://localhost:3000 npm run buildIf unset, the new homepage sections still show demo cards and wiring notes; the site header omits the “Game console” / “Live desk” shortcuts until the variable is provided.
For GitHub Actions deploys, define a repository variable PUBLIC_GAME_CONSOLE_URL (Settings →
Secrets and variables → Actions → Variables) so Pages builds pick up the same origin. Add
PUBLIC_BROADCAST_PIPELINE_URL when your pipeline workpad has a stable URL, and optional
PUBLIC_BROADCAST_EMBED_HOSTS to allowlist which hosts may load in the Broadcast iframe.
Train includes /models/liquid-ai/, a local-only tuning desk under the Models section for the mirrored LiquidAI LFM2 transcript
GGUF. It calls an Ollama-compatible POST /api/generate endpoint from the browser, then lets you
score the result and export JSONL tuning pairs.
npm run dev:liquidThat command creates the local Ollama model liquidai-lfm2-transcript:q4km from
Modelfile.liquidai-lfm2-transcript when needed, starts Ollama if it is not already responding, and
then starts Astro. Open http://localhost:4321/models/liquid-ai/.
If Ollama is already running and the browser blocks the request, restart Ollama with matching local
origins, for example OLLAMA_ORIGINS=http://localhost:4321,http://127.0.0.1:4321 ollama serve.
The homepage now includes an indicative hop and speed analysis layer for major exchange and index routes, relay points, live pipeline stages, junction bottlenecks, and hardware profiles. It is meant as planning guidance for getting market data into Train at speed, not as a measured SLA.
The site now includes /featured/, a timezone-aware market-focus page that:
- picks a regional session from the visitor timezone or
?exchange=... - shows a curated 10-major stock basket for that region from the local catalog
- links out to TradingView and Yahoo heatmaps for live stock-performance context
- loads CoinGecko live crypto movers and the CoinGecko heatmap widget client-side
Because the repo still ships static listing metadata rather than full end-of-day quote history, the stock basket is curated by region and exchange focus, while the live ranking context comes from the external heatmap sources.
| Venue | Benchmarks / flows | Primary PoP | Relay pattern | Indicative budget |
|---|---|---|---|---|
| CME / Aurora | ES, NQ, RTY, YM, GC, SI, HG | Aurora I / CH2 | Aurora -> Chicago relay -> NY4/NY5 | sub-0.1 ms local, 0.3-0.9 ms metro, 8-10 ms to NJ |
| NASDAQ / NYSE / NYSE Arca / OPRA | QQQ, SPY, cash equities, ETFs, options | Carteret / Mahwah / Secaucus / NY4 | Venue edge -> NJ relay -> Chicago and London | sub-0.15 ms local, 0.2-1.1 ms metro, 8-10 ms to Chicago |
| LSE / ICE Europe / Euronext | FTSE, ICE energy, STOXX-linked flows | LD4 / Basildon / Slough | London edge -> LD4 relay -> Frankfurt and NY | sub-0.15 ms local, 0.3-1.2 ms metro, 3-5 ms to Frankfurt |
| Deutsche Boerse / Eurex | DAX, Euro Stoxx, Bund complex | FR2 / FR5 | Frankfurt edge -> Frankfurt relay -> LD4 and NY | sub-0.12 ms local, 0.2-0.8 ms metro, 3-5 ms to London |
| JPX / OSE | Nikkei 225, TOPIX, JGB-linked flows | TY3 / Tokyo metro | Tokyo edge -> Tokyo relay -> SG/HK | sub-0.12 ms local, 0.2-0.9 ms metro, 35-70 ms regional |
| HKEX / SGX | Hang Seng, CNH, SGX derivatives | HK1 / HK3 / SG1 | Local edge -> HK/SG relay -> Tokyo and London | sub-0.15 ms local, 0.3-1.0 ms metro, 30-40 ms HK<->SG |
- Aurora / Chicago for futures-first books and metals.
- New Jersey / NY4-NY5 for US cash equity, ETF, and options aggregation.
- London / LD4 for European fanout and transatlantic handoff.
- Frankfurt for Eurex/Xetra primaries or hot standby.
- Tokyo + Singapore as the regional APAC relay mesh.
The architecture section now explicitly covers the full live-data chain:
- Venue ingress with primary and secondary handoffs.
- Lossless capture with timestamping and parallel hot hosts.
- Decode / normalize for trades, books, status, and instrument metadata.
- Gap fill / reconciliation for retransmit and replay continuity.
- Regional relay mesh for Chicago, New Jersey, London, Frankfurt, Tokyo, and Singapore.
- Hot cache / serving plane for session-aware stream delivery.
- Historical persistence for raw and normalized archives.
- Client delivery for live subscriptions, reconnect, and replay bootstrap.
For each stage, the page now lists:
- Primary path
- Redundant path
- Data carried
- Critical junction
- Likely bottleneck
- Tech stack
The page now calls out the main join points where low-latency systems usually fail first:
- Venue demarcation
- Capture-to-parser handoff
- Relay fanout spine
- Gateway and entitlement plane
- Archive and replay contention
The architecture section also breaks the plant down by stack layer:
- Time and clocking
- Network fabric
- Capture and feed handling
- Normalization and stream fabric
- Storage and replay
- Serving and client edge
- Capture edge: 16-32 high-clock cores, 64-128 GB ECC, dual 25 GbE or 100 GbE NICs, NVMe-first scratch.
- Relay / normalization: 24-48 cores, 128-256 GB ECC, dual 25/100 GbE, 4-8 TB NVMe RAID1/10.
- Strategy hot path: 8-24 high-frequency cores, 64-128 GB ECC, 10/25 GbE, local NVMe cache.
- Research / replay: 32-64 cores, 128-512 GB ECC, 10/25 GbE, 8-32 TB NVMe or U.2 arrays.
Fast-path assumptions:
- Linux 6.x LTS with IRQ affinity and CPU isolation tuned.
- PTP or tightly disciplined clock sync.
- NIC hardware timestamping enabled.
- NVMe on the ingestion hot path.
- 25 GbE minimum east-west relay links, with 100 GbE where multi-venue fanout is shared.
npm install --cache .npm-cache
npm run devThe data-prep step runs automatically before dev and build.
npm run buildThe production site is generated in dist/.
The Astro config auto-detects the correct base path during GitHub Actions builds:
- user/org pages:
/ - project pages:
/<repo-name>/
If you need to override it manually, set PUBLIC_BASE_PATH.
The deployment workflow is in .github/workflows/deploy.yml.
When the Train source CSV changes, rebuild the site:
npm run prepare:data
npm run build../../erika/artifacts/ticker_checklist.csv../../erika/artifacts/ticker_checklist_summary.md
The directory currently focuses on listing metadata and coverage signals, not time-series charting.
The catalog builder now passes through optional filing / headquarters / branch metadata when the upstream CSV includes it. Supported output fields include:
lt/ll/lc/fs— filing timestamp, filing location, filing coordinates, filing sourcehq/hc/hs— headquarters location, headquarters coordinates, headquarters sourcebr/bs— branch locations and branch source
For a public static dataset, keep the enrichment stack on sources that are safe to redistribute:
- SEC EDGAR — public domain; use for US filer addresses, state of incorporation, and filing history proxies
- GLEIF Golden Copy — CC0; use for legal / headquarters addresses and LEI-linked entity data
- Wikidata — CC0; use for founding dates and headquarters coordinates when available
- National open registries (for example Companies House / OGL, BRREG / NLOD, INSEE / Etalab, PRH / CC-BY) — allowed with attribution
Avoid feeding the published catalog from sources that would impose share-alike or redistribution restrictions unless the repository license and downstream expectations are updated accordingly:
- OpenCorporates free tier — ODbL/share-alike
- OpenStreetMap hosted Nominatim bulk geocoding — not allowed for bulk resolution; ODbL applies
- Exchange reference feeds without explicit open terms — redistribution is legally unclear
Practical limit: authoritative branch-office coverage is not realistically available for all ~56k tickers from free redistributable sources, so branch data should remain optional and sparse unless a licensed commercial source is added.
The OpenRegistry Cursor plugin documents
statutory registry fields (legal name, registered office, incorporation date, branches). The free
OpenRegistry search API does not expose jurisdiction US — US tickers are enriched from
SEC EDGAR (data.sec.gov/submissions) instead, which supplies the canonical legal name (for
example Tesla, Inc. for TSLA), business address, and CIK-linked registry metadata.
# Fetch SEC rows for specific tickers (writes public/data/enrichment/sources/sec-edgar.ndjson)
npm run enrichment:fetch -- --tickers TSLA,AAPL,MSFT --delay 0.2
US SEC rows set **statutory filing time** (`lt`) from the earliest registration-class
SEC acceptance timestamp (S-1 / 8-A12B / etc.), **filing venue** (`ll`) to the state
secretary-of-state office (not the issuer HQ street), and coordinates (`lc` / `hc`) via
the US Census geocoder when possible plus SOS/city fallbacks. Pass `--no-geocode` for
faster bulk runs.
# Or batch US exchange listings (respect SEC rate limits)
npm run enrichment:fetch -- --exchange NASDAQ --limit 200 --delay 0.25
# Merge sources → overlay → catalog
npm run enrichment:build
npm run prepare:data
# Batch listing profiles (identity / classification / quick-view fields) with resume state
npm run profiles:collect -- --exchange NASDAQ --limit 100 --delay 0.25
npm run profiles:collect -- --tickers TSLA,AAPL --delay 0.2Listing pages need OHLCV history, OHLCV-scored snowflake shards, quotes, and catalog/enrichment fields. Use the batch orchestrator (resumable; run in chunks):
# Coverage snapshot (history files, ic/fc, quote count)
npm run database:report
# Normalize dc/ic/fc from existing catalog dates (fast, all ~56k rows)
npm run database:normalize && npm run prepare:data
# US history backfill (slow — rate-limit friendly; repeat with --skip)
npm run database:backfill -- --phase history --exchange NASDAQ,NYSE --limit 500
npm run database:backfill -- --phase prepare
# Session quotes for quick view (yfinance)
npm run database:backfill -- --phase quotes --limit 2000
# SEC/OpenRegistry identity fields
npm run database:backfill -- --phase enrichment --limit 100
# All-in-one chunk (example)
npm run database:backfill -- --phase all --limit 300After history exists, build_health_shards.py embeds OHLCV-derived snowflake scores in every
shard (no per-ticker patch file required). Manual research overlays still live under
public/data/health/patches/v1/<TICKER>.json and override the shard entry.
Non-US listings can use --openregistry-only or mixed mode when the catalog country maps to a
supported jurisdiction (GB, FR, IE, ES, NO, …). Install the MCP server from
https://openregistry.sophymarine.com/mcp for interactive profile lookups during editing.
Use the enrichment builder to merge SEC EDGAR, GLEIF, Wikidata, and optional per-country registry
inputs into one deterministic overlay consumed by build_catalog_data.py. When
public/data/enrichment/sources/sec-edgar.ndjson or openregistry.ndjson exist, they are merged
automatically (no flags required).
npm run enrichment:build -- \
--sec ./data/sec-edgar.ndjson \
--gleif ./data/gleif.ndjson \
--wikidata ./data/wikidata.ndjson \
--registry ./data/companies-house.ndjson --registry-license OGLOutput is written to public/data/enrichment/attribution-safe.json and contains only
redistribution-safe fields (ln, ls, lt, ll, lc, fs, hq, hc, hs, cd, optional
br / bs, optional registry) plus per-field source and update metadata. During
npm run prepare:data, the catalog builder automatically applies this overlay when present.
Every catalog listing now also has a 5-axis snowflake coverage profile (Value, Future,
Past, Health, Dividends) modeled on https://roi.kevencraftrituals.com/health.html. The
profiles live under public/data/health/:
public/data/health/
├── manifest.json # tiny (~42 KB) index: shards + patches + sha-256 etags
├── shards/v1/index.json # ticker → shardId map
├── shards/v1/0000.json # ~256 listings per shard (~600 KB)
├── shards/v1/0001.json
├── ...
└── patches/v1/<TICKER>.json # per-listing overlays (small shims/blobs)
The browser loads only the manifest.json plus whichever shards or patches it actually
needs, caching them in sessionStorage keyed by their etag. When the research desk
flips a single check, only that 1–3 KB patch (or shard) is re-downloaded — never the
20 MB catalog.
# 1. Scaffold an editable overlay for one ticker
python3 scripts/scaffold_health_patch.py AAPL
# 2. Edit public/data/health/patches/v1/AAPL.json
# Flip check `state` from "na" to "pass"/"fail" and fill `detail`.
# 3. Promote the patch into the manifest (recomputes its etag)
npm run prepare:dataThe Cloudflare Worker under workers/api-qbitos-ai/ exposes the same data with edge
caching:
| Endpoint | Purpose |
|---|---|
GET /v1/health |
Worker liveness + Train Pages reachability |
GET /v1/coverage/manifest |
Snowflake manifest (shards + patches + etags) |
GET /v1/coverage/index |
Snowflake ticker → shard id map |
GET /v1/coverage/shard?id=NNNN |
Snowflake shard JSON (etag-cached, supports If-None-Match) |
GET /v1/coverage/listing?ticker=AAPL |
Snowflake profile (patch overlay if present, else shard entry) |
GET /v1/coverage/diff?since=<etag> |
Snowflake shards/patches changed since the caller's last manifest |
GET /v1/history/manifest |
5y history manifest (per-ticker etags) |
GET /v1/history/listing?ticker=AAPL |
Per-ticker OHLCV series (If-None-Match → 304) |
GET /v1/history/diff?since=<etag> |
Tickers whose series file changed |
GET /v1/catalog.json |
Cached pass-through of the full catalog |
To point the browser at the API instead of static GitHub Pages, set
window.__QBITOS_API_ORIGIN__ = 'https://api.qbitos.ai' before the page scripts run.
Deploy:
cd workers/api-qbitos-ai
npm install
npx wrangler deployEvery US-listed ticker on NASDAQ, NYSE, NYSEARCA, NYSEMKT, CBOE, AMEX,
BATS, OTC, OTCBB (~21 K listings) can be backed by a daily OHLCV series stored
under:
public/data/history/
├── manifest.json # ticker index + sha-256 etags (small, fetched once)
└── series/v1/<TICKER>.json # parallel-array OHLCV (~30 KB / 5y, ~250 KB / full history)
The format is parallel arrays (d, o, h, l, c, v) with YYYYMMDD integer
dates — gzip-friendly, and ECharts consumes it directly via lttb sampling. The
listing page (/listing/?id=AAPL) auto-renders a price line chart, a
calendar-simple
daily-return heatmap, a flowGL-noise
return-driven flow panel, and a snowflake radar plus check grid whenever history /
coverage data exists for that ticker. Metrics include total return, annualized return,
max drawdown, and average volume. The chart adapts to whatever span the file actually covers
(e.g. AAPL's full series goes back to 1980, IBM's to 1962).
Like the snowflake coverage system, only the small manifest is fetched up front; per-
ticker series files are loaded on demand and cached in sessionStorage keyed by their
sha-256 etag, so a single ticker's data refresh is one ~30–250 KB file — never the
full catalog.
The fetcher defaults to --lookback max, which asks each upstream for the full
available history per ticker. You can shrink the window when bandwidth or storage
matters:
--lookback max # default: full available history (1962+ for the oldest tickers)
--lookback 5y # last 5 calendar years only
--lookback 10y # last 10 calendar years only
--lookback 1825d # last 1825 days
Each series file records both lookbackYears (numeric hint, 0 for max) and
lookbackMode ('fixed' | 'max') so the listing-page UI can honestly label the span.
Pick the path your environment can authenticate.
A. yfinance (recommended for full backfill — handles Yahoo's auth crumbs):
pip install yfinance
python3 scripts/fetch_us_history.py --source yfinance \
--exchange NASDAQ,NYSE,NYSEARCA,NYSEMKT,CBOE \
--lookback maxB. Yahoo's keyless v8 chart endpoint (small batches; rate-limits aggressively):
npm run history:fetch -- --tickers AAPL,MSFT,SPY,QQQ,NVDA --lookback max --delay 1.5C. Stooq with API key (US series back to ~1969):
export STOOQ_APIKEY=<your-key>
npm run history:fetch -- --source stooq --exchange NASDAQ,NYSE --lookback max --delay 0.3D. Manual CSV import (any source — Polygon, Tiingo, your data lake, Yahoo CSV download from the browser):
npm run history:import -- --lookback max ./csvs/AAPL.csv ./csvs/MSFT.csv
# header columns auto-detected: Date, Open, High, Low, Close, VolumeAfter any backfill, regenerate the manifest:
npm run history:manifest # or just run npm run prepare:data / npm run buildSeed a ~25-year synthetic AAPL series and a demo snowflake patch with mixed pass/fail states, then verify the build:
npm run smoke:viz:apply # seed only (open /listing/?id=AAPL)
npm run smoke:viz # seed + npm run check + npm run build
npm run smoke:viz:cleanup # remove series + restore NA-only AAPL patchCleanup leaves public/data/history/manifest.json at 0 entries and does not commit
fake OHLCV.
--exchange NASDAQ,NYSE # comma-separated US exchanges
--tickers AAPL,MSFT,SPY # explicit list
--limit 500 # cap per run for sharded CI jobs
--skip 500 # offset (parallel-shard across machines)
--max-age-days 7 # skip tickers refreshed within this window
--force # re-fetch even when fresh
--delay 1.5 # seconds between requests
--lookback max # full available history (default), or 5y / 10y / 1825d