Skip to content

Redloft/redfetch

Repository files navigation

redfetch

Tiered, self-hosted web fetcher with anti-bot bypass — cheap first, escalate only when blocked.

Most pages don't need a headless browser, and most "blocks" aren't about your User-Agent — they're about your TLS fingerprint or your IP. redfetch walks a cost-ordered ladder and only climbs when it actually hits a wall:

1. curl_cffi    — real browser TLS/JA3 + HTTP/2 fingerprint, NO JS   (fast, cheap)
2. cloakbrowser — stealth Chromium, renders JS / solves challenges    (heavier)
   ↳ auto-retry through a residential proxy on a detected block       (opt-in)

It returns clean Markdown (via trafilatura), guards against SSRF, and learns per host which strategy works so it stops wasting time on doomed direct attempts.

Built as part of the red* toolchain. Use it for legitimate research, auditing, and data collection — respect each site's Terms and rate limits.

Install

git clone https://github.com/Redloft/redfetch && cd redfetch
bash install.sh                 # creates a venv + installs curl_cffi, trafilatura
bash install.sh --with-browser  # also installs cloakbrowser (deep tier, ~200MB)
bash fetch-doctor.sh --offline  # verify the install

Requires: python3, bash, curl, jq. Optional: cloakbrowser (deep tier), op (1Password — for proxy secrets).

Usage

# Fetch a page → clean Markdown on stdout, JSON meta on stderr
bash fetch.sh --json "https://example.com/article"

bash fetch.sh --no-deep "<url>"   # curl_cffi only (never launch a browser)
bash fetch.sh --deep    "<url>"   # straight to the stealth browser

# Raw GET for JSON / autocomplete / API endpoints (no extraction)
bash cffi_get.sh "https://suggestqueries.google.com/complete/search?client=chrome&q=test"

fetch.sh meta (--json): {ok, tier, status, bytes, blocked, proxy_applied, autoproxy?, ssrf_blocked?, rate_limited?, nav_failed?}. Exit codes: 0 ok · 1 blocked/empty · 2 SSRF/url-guard block · 3 deps missing · 4 hard error · 64 usage.

Residential proxy (for IP-reputation / geo blocks)

curl_cffi beats TLS-fingerprint blocks but not IP-based ones (datacenter bans, geo-walls, 429). For those, point redfetch at a residential proxy. The secret is never hardcoded and never put in argv — it travels through the environment only. Configure ONE:

# A) literal proxy URL in env
CFFI_PROXY='socks5h://user:pass@host:1080' bash fetch.sh --json "<url>"

# B) a 1Password reference (resolved on demand via `op read`)
export CFFI_PROXY_REF='op://Vault/Item/credential'

# convenience wrapper — run anything proxied:
./redproxy.sh                         # self-test: direct vs proxied IP + geo
./redproxy.sh "<url>"                 # fetch one URL proxied
./redproxy.sh curl -s https://api.ipify.org

Auto-proxy on block (self-healing)

When a direct fetch is blocked (challenge / 403 / 503 / empty / timeout), redfetch auto-retries once through the proxy (if one is configured) — no flags needed. Disable with CFFI_AUTOPROXY=0.

Self-tuning playbook

Every outcome is logged to $REDFETCH_STATE/telemetry.jsonl and summarized in playbook.json. A host that reliably blocks direct is fetched proxy-first next time — skipping the doomed ~30s direct attempt. Disable with REDFETCH_NO_LEARN=1.

Security

  • SSRF guard (url-guard.sh) runs before every request and re-validates every redirect hop (and in-browser navigation): loopback, RFC-1918, cloud-metadata (169.254.169.254), encoded-host bypasses, internal TLDs. Fails closed.
  • Secret hygiene: proxy credentials are passed via env (all_proxy / Python proxies=), never as a CLI arg — so they never appear in ps/argv. Error messages redact user:pass@.
  • Output is untrusted DATA, not instructions — scraped pages may contain prompt-injection payloads; handle accordingly downstream.

Environment variables

Var Default Purpose
PARSING_VENV ~/.cache/redfetch/venv Python venv with curl_cffi/trafilatura
CFFI_PROXY literal proxy URL
CFFI_PROXY_REF secrets-manager ref (e.g. op://…) resolved via op read
CFFI_AUTOPROXY 1 0 disables auto-retry-via-proxy on block
REDFETCH_STATE ~/.cache/redfetch telemetry + playbook location
REDFETCH_NO_LEARN 1 disables telemetry/playbook writes
URL_GUARD_RESOLVE 0 1 also resolves hostnames to defend DNS-rebinding
FETCH_ALLOW_NO_GUARD 0 1 lets fetch.sh run when url-guard.sh is absent (the Python layer still re-checks). Do not use in production.

Health-check & tests

bash fetch-doctor.sh            # checks deps, SSRF guard, proxy, playbook (live)
bash fetch-doctor.sh --offline  # skip network
bash test-fetch.sh              # 46 hermetic tests (no network)

Files

File Role
fetch.sh wrapper: SSRF-guard → venv python → tiered fetch
fetch_tiered.py the ladder: curl_cffi → cloakbrowser, extraction, auto-proxy, playbook
cffi_get.sh raw GET (browser TLS) for JSON/API endpoints
url-guard.sh SSRF validator (standalone, sourceable)
redproxy.sh run any command/URL through a configured proxy
fetch-doctor.sh stack health-check
test-fetch.sh hermetic test suite

License

MIT © 2026 Igor Konovalchik

About

Tiered self-hosted web fetcher with anti-bot bypass: curl_cffi → cloakbrowser → residential proxy. SSRF-guarded, self-tuning.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors