Skip to content

xiaokillua/siftmark

Repository files navigation

SiftMark

中文说明 | MCP | Design notes | Agent skill

SiftMark is a zero-runtime-dependency Python CLI that turns public web pages into clean Markdown, JSON, JSONL, and link maps for AI agents.

It is built for people using OpenClaw, Hermes, Codex, Claude Code, Cursor, or any shell-capable agent that needs compact, citation-friendly web context without starting a browser stack.

SiftMark local research console

Why It Exists

AI agents often need compact web context that is easy to cite, inspect, and reuse. Full browser automation is useful for complex workflows, but many research tasks only need a polite public-page distiller that can run anywhere Python runs.

SiftMark focuses on that narrow job:

  • one URL to clean Markdown for LLM context
  • small same-domain crawls to a reusable research bundle
  • JSON-LD, headings, links, images, and metadata extracted together
  • a portable SKILL.md generator for agent workflows
  • no runtime dependencies, no API key, no browser required

SiftMark is not an anti-bot bypass tool. It is a polite public-web distiller that respects robots.txt by default.

Install

From a clone:

git clone https://github.com/xiaokillua/siftmark.git
cd siftmark
python3 -m pip install -e .

Directly from GitHub:

python3 -m pip install "git+https://github.com/xiaokillua/siftmark.git"

Check it:

siftmark version

Quick Start

Fetch one page as Markdown:

siftmark fetch https://example.com

Fetch one page as JSON:

siftmark fetch https://example.com --format json --output example.json

Create a small crawl bundle:

siftmark crawl https://example.com --depth 1 --max-pages 10 --output ./example-bundle

The bundle contains:

example-bundle/
  README.md
  index.json
  links.csv
  pages.jsonl
  pages/
    example.com.md
    example.com.json

Generate an agent skill:

siftmark skill --target openclaw --output ./skills/siftmark-web-research
siftmark skill --target hermes --output ./skills/siftmark-web-research
siftmark skill --target codex --output ./skills/siftmark-web-research

Open the local research console:

siftmark ui

Run the stdio MCP server for compatible agent clients:

siftmark mcp

Demo

The local console gives the project a quick visual workflow:

siftmark ui

Open the page, enter a public URL, then use Fetch for one-page Markdown/JSON or Crawl for a small same-domain research bundle.

For agent clients, add SiftMark as a stdio MCP server:

{
  "mcpServers": {
    "siftmark": {
      "command": "siftmark",
      "args": ["mcp"]
    }
  }
}

See docs/MCP.md for details.

Python API

from siftmark import CrawlOptions, FetchOptions, crawl, fetch_page, write_bundle

page = fetch_page("https://example.com")
print(page.markdown)

result = crawl(
    "https://example.com",
    CrawlOptions(max_pages=5, depth=1, fetch=FetchOptions(timeout=10)),
)
write_bundle(result, "example-bundle")

CLI Reference

siftmark fetch URL [--format markdown|json] [--output PATH]
siftmark crawl URL [--depth N] [--max-pages N] [--output DIR]
siftmark skill [--target generic|openclaw|hermes|codex|claude-code] [--output DIR]
siftmark ui [--host 127.0.0.1] [--port 8765] [--no-open]
siftmark mcp
siftmark version

Useful flags:

  • --ignore-robots: skip robots.txt checks when you have permission
  • --user-agent: set a custom crawler identity
  • --max-bytes: cap page size before parsing
  • --external: allow off-domain links during crawls
  • --delay: add crawl delay between pages
  • --insecure: disable TLS verification only when your local Python certificate store is broken and you trust the target

Responsible Use

Use SiftMark only for public pages you are allowed to access. Respect robots.txt, terms of service, copyright, privacy, rate limits, and local laws. For JavaScript-heavy, login-only, paywalled, or protected pages, use a browser automation tool with explicit permission instead.

Roadmap

  • MCP client configuration examples for Claude Desktop, Cursor, Codex, and OpenClaw-compatible tools
  • optional Playwright adapter for JavaScript-rendered pages
  • selector memory for repeat extraction jobs
  • output templates for research reports and dataset cards
  • packaged examples for OSINT, docs migration, competitive research, and RAG prep

Recommended GitHub Topics

Set these topics after publishing:

ai-agents web-scraping markdown python openclaw hermes-agent codex claude-code research llm jsonl

Development

python3 -m pip install -e .
python3 -m unittest discover -s tests

Release notes and PyPI preparation live in docs/PUBLISHING.md.

License

MIT

About

Zero-runtime-dependency web-to-Markdown and JSON distiller for AI agents.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages