Skip to content

nizarfadlan/threads-scraper

Repository files navigation

ThreadsAPI

ThreadsAPI is a reusable Python library for collecting public Threads feed data using GraphQL replay. It is designed for maintainability, explicit configuration, and production-oriented error handling.

Disclaimer This project is provided only for educational, research, or entertainment purposes. Users are responsible for complying with applicable laws, platform terms of service, and data privacy requirements.

Key Capabilities

  • Fetch public logged-out Threads feed data with multi-page pagination.
  • Country filtering via standard ISO 3166-1 alpha-2 codes (validated with pycountry).
  • Capture runtime replay parameters (tokens, cookies, request metadata).
  • Strategy-aware token and doc_id refresh (HTTP-first in auto, controlled fallback to browser).
  • Expand same-author thread continuations, including media in child posts.
  • Export OpenSearch-friendly post documents.
  • Optionally persist local session state to JSON.

Installation

This project uses uv and pyproject.toml as the dependency source of truth.

uv sync
uv run scrapling install

If browser assets need to be reinstalled:

uv run scrapling install --force

Quick Start

from threadsapi import ThreadsScraper

async with ThreadsScraper(concurrent=3) as scraper:
    posts = await scraper.crawl_pages(
        pages=3,
        country="ID",
        include_full_thread=True,
    )

Configuration

Using ScraperConfig

from threadsapi import ScraperConfig, ThreadsScraper

config = ScraperConfig(
    concurrent=3,
    bootstrap_strategy="auto",
)

async with ThreadsScraper(config=config) as scraper:
    posts = await scraper.crawl_pages(pages=2)

From environment variables

Copy the example configuration and export or load it into your runtime environment:

cp .env.example .env
from threadsapi import ScraperConfig

config = ScraperConfig.from_env()

See .env.example for all supported environment variable names and default example values. ScraperConfig.from_env() reads variables already available in the process environment; load .env through your application runner or environment management tool when needed.

From YAML

Copy the example file before customizing local values:

cp threadsapi.example.yaml threadsapi.yaml
from threadsapi import ScraperConfig

config = ScraperConfig.from_yaml("threadsapi.yaml")

See threadsapi.example.yaml for a complete configuration example.

ThreadsScraper constructor options

ThreadsScraper(
    concurrent=3,
    bootstrap_strategy="auto",        # "auto" | "http" | "browser"
    auth_strategy="session",          # "session" | "direct" | "auto"; inferred as "direct" when login credentials are provided
    session_path="threads-session.json",
    persist_session=False,
    base_url="https://www.threads.com",
    graphql_url="https://www.threads.com/graphql/query",
    app_id=None,
    asbd_id=None,
    user_agent="...",
    timeout_seconds=30,
    browser_headless=True,
    login_username=None,
    login_password=None,
    login_two_factor_code=None,
)

You normally do not need to set mode. Without login credentials the scraper starts anonymous; with username/password it switches to authenticated direct login automatically.

app_id and asbd_id are public runtime header values observed from Threads web requests. They are optional: when omitted, the scraper adopts them from captured runtime request headers during bootstrap. These values are not required to be static — the library discovers them automatically.

Country Filtering

The country parameter accepts ISO 3166-1 alpha-2 codes (e.g. "ID", "US", "JP") — validated via pycountry. Pass "world" or None to disable country filtering.

posts = await scraper.crawl_pages(pages=2, country="US")

Invalid codes raise ValueError immediately. A valid ISO country may still return no public feed content from Threads; treat that as NO CONTENT, not as invalid input.

Bootstrap and Refresh Strategies

Bootstrap and token/doc_id refresh both respect the configured strategy:

  • auto (default): HTTP-first bootstrap and refresh, with browser fallback only when replay state is incomplete or HTTP fails.
  • http: HTTP-only — lighter, but will raise errors if HTTP bootstrap cannot provide complete replay state.
  • browser: always use browser capture — more resource-heavy but captures the richest replay state.

Refresh is triggered automatically on auth failures (401/403), expired doc_ids, or GraphQL execution errors.

Direct Web Login and Session Persistence

For this GraphQL scraper, “login to Threads” means capturing an authenticated Threads web session and reusing the resulting cookies/runtime replay state for threads.com/graphql/query.

Direct web login posts username/password to Threads web login endpoints. It uses safe password encryption when key material is discoverable from Threads page/JS bundles, handles two_factor_required, and persists only authenticated web cookies/tokens — never the password or 2FA code.

The mobile Instagram Bloks login endpoint returns a mobile bearer token (Bearer IGT:2:<token>). That token is for mobile private API requests and does not provide the web cookies, lsd, fb_dtsg, headers, doc IDs, or captured variables required by this GraphQL replay client.

from threadsapi import ThreadsScraper, TwoFactorRequired

scraper = ThreadsScraper(
    bootstrap_strategy="auto",
    session_path="threads-session.json",
    persist_session=True,
    login_username="your_username",
    login_password="your_password",
    login_two_factor_code=None,  # set to auto-submit or leave None
)

try:
    await scraper.init()
except TwoFactorRequired as exc:
    code = input("2FA code: ")
    await scraper.complete_two_factor(exc.challenge, code)

await scraper.close()

After the session is saved, reuse it normally. The persisted authenticated session is preferred automatically when it is still fresh:

async with ThreadsScraper(
    bootstrap_strategy="auto",
    session_path="threads-session.json",
    persist_session=True,
) as scraper:
    posts = await scraper.crawl_pages(pages=2, include_full_thread=True)

From the TUI:

uv sync --extra tui
uv run python scripts/tui.py

Open Config & Info, set the session path, enter username/password, then click Direct Login. If Threads requires 2FA, enter the code and click Verify 2FA. Use Cancel Login to stop an active login attempt.

To validate login, check Config & Info: authenticated sessions show Mode: authenticated and Auth cookies: yes. You can also use Account Search; search uses logged-in GraphQL variables, while public feed alone is not a login proof because it can work anonymously.

From environment or YAML, credentials are enough to enable direct authenticated login:

THREADSAPI_USERNAME=your_username
THREADSAPI_PASSWORD=your_password
login:
  username: your_username
  password: your_password

Limitations:

  • Password encryption depends on runtime key material in Threads web JS; raises PasswordEncryptionUnavailable when keys cannot be found.
  • Endpoint shape may change; this is experimental and may break without notice.
  • Direct login failures never fall back to anonymous scraping.

Error Model

ThreadsAPI uses typed exceptions for explicit handling:

  • ConfigError: invalid runtime configuration
  • BootstrapError: failed HTTP/browser bootstrap
  • AuthenticationError: direct web login failed
  • InvalidCredentialsError: username/password rejected
  • TwoFactorRequired: login requires 2FA (carries a challenge for complete_two_factor())
  • TwoFactorError: 2FA verification failed
  • PasswordEncryptionUnavailable: safe password encryption could not be performed
  • TransportError: terminal GraphQL transport failure
  • GraphQLDecodeError: non-JSON or invalid GraphQL response body
  • RateLimitError: retry budget exhausted for retryable rate-limit responses

Security Notes

Session JSON files contain sensitive cookies/tokens after login:

  • Do not commit session files to Git.
  • Keep them in a secure local/private environment.
  • Do not share token/cookie values in logs, issues, or public channels.

Development

Run example:

uv run python example.py

Run tests:

uv run python -m unittest discover -s tests -p "test_*.py"
uv run python -m py_compile threadsapi/*.py tests/test_*.py example.py scripts/test_countries.py

Test country availability quickly:

uv run python scripts/test_countries.py
uv run python scripts/test_countries.py ID US JP world --per-page 3

Project Structure

threadsapi/
├── __init__.py      # public exports
├── client.py        # ThreadsScraper orchestration
├── session.py       # token/session model + JSON persistence
├── registry.py      # doc_id registry/discovery
├── bootstrap.py     # HTTP/browser/session bootstrap
├── web_auth.py      # direct Threads web login + 2FA
├── transport.py     # GraphQL request lifecycle and retry handling
└── parsers.py       # parsing and OpenSearch document mapping

Troubleshooting

Issue Resolution
ModuleNotFoundError: No module named 'curl_cffi' Ensure scrapling[fetchers] is installed, then run uv sync
Missing browser executable Run uv run scrapling install
Deprecated Use StealthyFetcher.configure() warning Use StealthyFetcher.configure(...) and StealthyFetcher.async_fetch(...)

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages