中文说明 | MCP | Design notes | Agent skill
SiftMark is a zero-runtime-dependency Python CLI that turns public web pages into clean Markdown, JSON, JSONL, and link maps for AI agents.
It is built for people using OpenClaw, Hermes, Codex, Claude Code, Cursor, or any shell-capable agent that needs compact, citation-friendly web context without starting a browser stack.
AI agents often need compact web context that is easy to cite, inspect, and reuse. Full browser automation is useful for complex workflows, but many research tasks only need a polite public-page distiller that can run anywhere Python runs.
SiftMark focuses on that narrow job:
- one URL to clean Markdown for LLM context
- small same-domain crawls to a reusable research bundle
- JSON-LD, headings, links, images, and metadata extracted together
- a portable
SKILL.mdgenerator for agent workflows - no runtime dependencies, no API key, no browser required
SiftMark is not an anti-bot bypass tool. It is a polite public-web distiller that respects robots.txt by default.
From a clone:
git clone https://github.com/xiaokillua/siftmark.git
cd siftmark
python3 -m pip install -e .Directly from GitHub:
python3 -m pip install "git+https://github.com/xiaokillua/siftmark.git"Check it:
siftmark versionFetch one page as Markdown:
siftmark fetch https://example.comFetch one page as JSON:
siftmark fetch https://example.com --format json --output example.jsonCreate a small crawl bundle:
siftmark crawl https://example.com --depth 1 --max-pages 10 --output ./example-bundleThe bundle contains:
example-bundle/
README.md
index.json
links.csv
pages.jsonl
pages/
example.com.md
example.com.json
Generate an agent skill:
siftmark skill --target openclaw --output ./skills/siftmark-web-research
siftmark skill --target hermes --output ./skills/siftmark-web-research
siftmark skill --target codex --output ./skills/siftmark-web-researchOpen the local research console:
siftmark uiRun the stdio MCP server for compatible agent clients:
siftmark mcpThe local console gives the project a quick visual workflow:
siftmark uiOpen the page, enter a public URL, then use Fetch for one-page Markdown/JSON or Crawl for a small same-domain research bundle.
For agent clients, add SiftMark as a stdio MCP server:
{
"mcpServers": {
"siftmark": {
"command": "siftmark",
"args": ["mcp"]
}
}
}See docs/MCP.md for details.
from siftmark import CrawlOptions, FetchOptions, crawl, fetch_page, write_bundle
page = fetch_page("https://example.com")
print(page.markdown)
result = crawl(
"https://example.com",
CrawlOptions(max_pages=5, depth=1, fetch=FetchOptions(timeout=10)),
)
write_bundle(result, "example-bundle")siftmark fetch URL [--format markdown|json] [--output PATH]
siftmark crawl URL [--depth N] [--max-pages N] [--output DIR]
siftmark skill [--target generic|openclaw|hermes|codex|claude-code] [--output DIR]
siftmark ui [--host 127.0.0.1] [--port 8765] [--no-open]
siftmark mcp
siftmark versionUseful flags:
--ignore-robots: skip robots.txt checks when you have permission--user-agent: set a custom crawler identity--max-bytes: cap page size before parsing--external: allow off-domain links during crawls--delay: add crawl delay between pages--insecure: disable TLS verification only when your local Python certificate store is broken and you trust the target
Use SiftMark only for public pages you are allowed to access. Respect robots.txt, terms of service, copyright, privacy, rate limits, and local laws. For JavaScript-heavy, login-only, paywalled, or protected pages, use a browser automation tool with explicit permission instead.
- MCP client configuration examples for Claude Desktop, Cursor, Codex, and OpenClaw-compatible tools
- optional Playwright adapter for JavaScript-rendered pages
- selector memory for repeat extraction jobs
- output templates for research reports and dataset cards
- packaged examples for OSINT, docs migration, competitive research, and RAG prep
Set these topics after publishing:
ai-agents web-scraping markdown python openclaw hermes-agent codex claude-code research llm jsonl
python3 -m pip install -e .
python3 -m unittest discover -s testsRelease notes and PyPI preparation live in docs/PUBLISHING.md.
MIT
