Skip to content

hivesong/ruminate

Repository files navigation

ruminate

Chew over your tweet archive.

A toolkit for building, slicing, and analysing a Twitter/X archive across multiple input formats. Format-agnostic by design — drop in a new ingest source and the rest of the pipeline works unchanged.

What it does

Pipeline — turn raw archive data into useful files:

  • collate — read an input source, dedupe by tweet ID, sort chronologically, and write tweets.json, tweets.md, and tweets-llm.txt
  • slice — split tweets-llm.txt into per-month files and bundle into tweets-llm.zip
  • pipeline — do both in one step

Analysis — explore the archive:

  • cluster — classify tweets into topic clusters (regex + fuzzy keyword matching, configurable via clusters.py)
  • search — find tweets by plain text, fuzzy match, or /regex/
  • month — read a single month with topic breakdown
  • stats — totals by year/month, tweet types, top mentions, top hashtags
  • debug — explain why a specific tweet matched (or didn't)

Supported Input Sources

Currently supported:

Planned (not yet implemented):

  • Official Twitter/X archive export (data/tweets.js)
  • community-archive.org JSON format

Adding a new format is intended to be straightforward — see docs/adding-a-source.md.

Quick start

git clone https://github.com/hivesong/ruminate
cd ruminate
bash setup.sh

# Launch the interactive menu
./ruminate

# Or run individual commands
./ruminate pipeline path/to/archive.zip
./ruminate cluster
./ruminate search "your search term"
./ruminate stats
./ruminate --help

setup.sh creates a local ./venv and installs dependencies there — your system Python and global packages are not touched.

Please note that if you want to work on this repo you need to use the SSH method. Assuming your keys are set up properly you will want to use this line to clone the repo instead:

git clone git@github.com:hivesong/ruminate.git

Requirements

  • Python 3.10+
  • openpyxl (for xlsx ingest)
  • rapidfuzz (for fuzzy keyword matching)

Both are installed automatically into the local venv by setup.sh.

Project layout

ruminate/
├── ruminate                       # bash launcher (top-level command)
├── setup.sh                       # one-time installer
├── clusters.py                    # ← edit this to customise topic clusters
├── pyproject.toml
├── requirements.txt
├── src/
│   └── ruminate/
│       ├── cli.py                 # menu + argparse dispatch
│       ├── tweet.py               # canonical Tweet dataclass
│       ├── parser.py              # tweets-llm.txt reader
│       ├── classifier.py          # regex + fuzzy
│       ├── output.py              # all file writers
│       ├── ingest/                # pluggable input formats
│       │   ├── base.py            # abstract Source class
│       │   └── ifttt_xlsx.py      # IFTTT Google Drive xlsx archives
│       └── commands/              # one module per subcommand
│           ├── collate.py
│           ├── slice.py
│           ├── cluster.py
│           ├── search.py
│           ├── month.py
│           ├── stats.py
│           └── debug.py
└── docs/
    ├── starter-prompt-template.md # for kicking off LLM analysis sessions
    └── adding-a-source.md         # how to wire in a new input format

Customising topic clusters

Open clusters.py and edit the CLUSTERS list. (this would be something an LLM is very good at generating btw)

Each cluster is a 4-tuple:

(
    "my-topic",                # slug — used in output filenames
    "My Topic Name",           # display name in reports
    [r"\bAPI\b", r"\bSDK\b"],  # regex patterns (case-insensitive)
    ["framework", "library"]   # fuzzy keywords (tolerate typos)
)

Regex patterns are best for short acronyms or specific tokens.
Fuzzy keywords tolerate spelling errors — the edit-distance threshold scales with word length:

Keyword length Allowed edits
≤ 3 chars 0 (exact)
4–9 chars 1
10–14 chars 2
15+ chars 3

After editing, run ./ruminate cluster to regenerate cluster files, or ./ruminate debug "tweet text fragment" to see exactly which patterns fired.

Adding a new input format

See docs/adding-a-source.md. The short version:

  1. Write a new module in src/ruminate/ingest/ that subclasses Source.
  2. Implement detect() (cheap check: does this look like my format?) and iter_tweets() (yield canonical Tweet objects).
  3. Register the class in src/ruminate/ingest/__init__.py.

No other code changes needed — every downstream command consumes canonical Tweet objects.

Output locations

All generated files go to ./output/:

  • output/tweets.json / tweets.md / tweets-llm.txt — collate outputs
  • output/months/YYYY-MM.txt — slice outputs
  • output/tweets-llm.zip — bundled archive for LLM sessions
  • output/tweet-clusters/cluster_*.txt — per-topic files
  • output/tweet-clusters.zip — bundled cluster archive
  • output/search_*.txt — saved search results

Roadmap

For planned features — including local LLM integration (Ollama), network & graph analysis of quote tweets and replies, advanced analysis tools, web UI, and more — see ROADMAP.md.

Contributing

See CONTRIBUTING.md for how to get involved.

License

MIT — see LICENSE.

About

Chew over your tweet archive. Collate, slice, cluster, and search tweets from Twitter exports, IFTTT backups, or the community-archive.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors