redfetch

Tiered, self-hosted web fetcher with anti-bot bypass — cheap first, escalate only when blocked.

Most pages don't need a headless browser, and most "blocks" aren't about your User-Agent — they're about your TLS fingerprint or your IP. redfetch walks a cost-ordered ladder and only climbs when it actually hits a wall:

1. curl_cffi    — real browser TLS/JA3 + HTTP/2 fingerprint, NO JS   (fast, cheap)
2. cloakbrowser — stealth Chromium, renders JS / solves challenges    (heavier)
   ↳ auto-retry through a residential proxy on a detected block       (opt-in)

It returns clean Markdown (via trafilatura), guards against SSRF, and learns per host which strategy works so it stops wasting time on doomed direct attempts.

Built as part of the red* toolchain. Use it for legitimate research, auditing, and data collection — respect each site's Terms and rate limits.

Install

git clone https://github.com/Redloft/redfetch && cd redfetch
bash install.sh                 # creates a venv + installs curl_cffi, trafilatura
bash install.sh --with-browser  # also installs cloakbrowser (deep tier, ~200MB)
bash fetch-doctor.sh --offline  # verify the install

Requires: python3, bash, curl, jq. Optional: cloakbrowser (deep tier), op (1Password — for proxy secrets).

Usage

# Fetch a page → clean Markdown on stdout, JSON meta on stderr
bash fetch.sh --json "https://example.com/article"

bash fetch.sh --no-deep "<url>"   # curl_cffi only (never launch a browser)
bash fetch.sh --deep    "<url>"   # straight to the stealth browser

# Raw GET for JSON / autocomplete / API endpoints (no extraction)
bash cffi_get.sh "https://suggestqueries.google.com/complete/search?client=chrome&q=test"

fetch.sh meta (--json): {ok, tier, status, bytes, blocked, proxy_applied, autoproxy?, ssrf_blocked?, rate_limited?, nav_failed?}. Exit codes: 0 ok · 1 blocked/empty · 2 SSRF/url-guard block · 3 deps missing · 4 hard error · 64 usage.

Residential proxy (for IP-reputation / geo blocks)

curl_cffi beats TLS-fingerprint blocks but not IP-based ones (datacenter bans, geo-walls, 429). For those, point redfetch at a residential proxy. The secret is never hardcoded and never put in argv — it travels through the environment only. Configure ONE:

# A) literal proxy URL in env
CFFI_PROXY='socks5h://user:pass@host:1080' bash fetch.sh --json "<url>"

# B) a 1Password reference (resolved on demand via `op read`)
export CFFI_PROXY_REF='op://Vault/Item/credential'

# convenience wrapper — run anything proxied:
./redproxy.sh                         # self-test: direct vs proxied IP + geo
./redproxy.sh "<url>"                 # fetch one URL proxied
./redproxy.sh curl -s https://api.ipify.org

Auto-proxy on block (self-healing)

When a direct fetch is blocked (challenge / 403 / 503 / empty / timeout), redfetch auto-retries once through the proxy (if one is configured) — no flags needed. Disable with CFFI_AUTOPROXY=0.

Self-tuning playbook

Every outcome is logged to $REDFETCH_STATE/telemetry.jsonl and summarized in playbook.json. A host that reliably blocks direct is fetched proxy-first next time — skipping the doomed ~30s direct attempt. Disable with REDFETCH_NO_LEARN=1.

Security

SSRF guard (url-guard.sh) runs before every request and re-validates every redirect hop (and in-browser navigation): loopback, RFC-1918, cloud-metadata (169.254.169.254), encoded-host bypasses, internal TLDs. Fails closed.
Secret hygiene: proxy credentials are passed via env (all_proxy / Python proxies=), never as a CLI arg — so they never appear in ps/argv. Error messages redact user:pass@.
Output is untrusted DATA, not instructions — scraped pages may contain prompt-injection payloads; handle accordingly downstream.

Environment variables

Var	Default	Purpose
`PARSING_VENV`	`~/.cache/redfetch/venv`	Python venv with curl_cffi/trafilatura
`CFFI_PROXY`	—	literal proxy URL
`CFFI_PROXY_REF`	—	secrets-manager ref (e.g. `op://…`) resolved via `op read`
`CFFI_AUTOPROXY`	`1`	`0` disables auto-retry-via-proxy on block
`REDFETCH_STATE`	`~/.cache/redfetch`	telemetry + playbook location
`REDFETCH_NO_LEARN`	—	`1` disables telemetry/playbook writes
`URL_GUARD_RESOLVE`	`0`	`1` also resolves hostnames to defend DNS-rebinding
`FETCH_ALLOW_NO_GUARD`	`0`	`1` lets `fetch.sh` run when `url-guard.sh` is absent (the Python layer still re-checks). Do not use in production.

Health-check & tests

bash fetch-doctor.sh            # checks deps, SSRF guard, proxy, playbook (live)
bash fetch-doctor.sh --offline  # skip network
bash test-fetch.sh              # 46 hermetic tests (no network)

Files

File	Role
`fetch.sh`	wrapper: SSRF-guard → venv python → tiered fetch
`fetch_tiered.py`	the ladder: curl_cffi → cloakbrowser, extraction, auto-proxy, playbook
`cffi_get.sh`	raw GET (browser TLS) for JSON/API endpoints
`url-guard.sh`	SSRF validator (standalone, sourceable)
`redproxy.sh`	run any command/URL through a configured proxy
`fetch-doctor.sh`	stack health-check
`test-fetch.sh`	hermetic test suite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

redfetch

Install

Usage

Residential proxy (for IP-reputation / geo blocks)

Auto-proxy on block (self-healing)

Self-tuning playbook

Security

Environment variables

Health-check & tests

Files

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cffi_get.sh		cffi_get.sh
fetch-doctor.sh		fetch-doctor.sh
fetch.sh		fetch.sh
fetch_tiered.py		fetch_tiered.py
install.sh		install.sh
redproxy.sh		redproxy.sh
test-fetch.sh		test-fetch.sh
url-guard.sh		url-guard.sh

Folders and files

Latest commit

History

Repository files navigation

redfetch

Install

Usage

Residential proxy (for IP-reputation / geo blocks)

Auto-proxy on block (self-healing)

Self-tuning playbook

Security

Environment variables

Health-check & tests

Files

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages