Skip to content

stephaniewilkinson/yonderbook

Repository files navigation

Yonderbook | Tools for Bookworms 📒

Stack

  • Framework: Roda (routing tree web toolkit) with Sequel ORM and SQLite
  • Auth: Rodauth (login, email auth/magic links, password reset, lockout)
  • Server: Falcon (async Ruby web server), using falcon serve with --threaded
  • CSS: Tailwind CSS, compiled via tailwindcss-ruby gem
  • Assets: Roda assets plugin with precompilation (assets/compiled_assets.json)
  • Ruby version: Defined in .ruby-version

Installation

git clone git@github.com:stephaniewilkinson/yonderbook.git
cd yonderbook
cp .env-example .env # if you msg me I can share my api keys
bundle install
rake db:migrate

Start the Server

falcon

Database Access

Production (Render):

sqlite3 /var/data/production.db

Development:

sqlite3 db/development.db

Testing

bundle exec rake test

Tests require environment variables — copy .env-example to .env and fill in values.

Key Files

  • app.rb — Main Roda application class with routing, plugins, and Rodauth config
  • config.ru — Rack config; loads Sentry, sets up env-specific middleware
  • Rakefile — Defines precompile, tailwind:build, tailwind:watch, and loads lib/tasks/*.rake
  • lib/database.rb — Sequel/SQLite setup; creates DB constant, path depends on RACK_ENV
  • lib/tasks/db.rake — Database rake tasks (migrate, reset, create_migration)

TODO: Clearly display the Goodreads name or logo on any location where Goodreads data appears. For instance if you are displaying Goodreads reviews, they should either be in a section clearly titled "Goodreads Reviews", or each review should say "Goodreads review from John: 4 of 5 stars..."

TODO: Link back to the page on Goodreads where the data data appears. For instance, if displaying a review, the name of the reviewer and a "more..." link at the end of the review must link back to the review detail page. You may not nofollow this link.

Spam Prevention

The signup form uses a honeypot field to block bot registrations. A hidden name field is rendered off-screen — humans never see it, but bots parsing the form will fill it in. If the field has a value on POST, the request is silently redirected to the /check-email page without any database interaction. The bot thinks the signup succeeded.

BookMooch API

BookMooch is a book trading community where users can give away books they no longer need and receive books they want.

Rate Limits

The BookMooch API allows up to 10 requests/second. Exceeding this results in 302 redirect responses (not standard 429s). In practice, keeping requests concurrent with a connection pool limit (rather than throttling with a rate limiter) works best — a leaky bucket limiter causes timeouts and connection issues with BookMooch's server.

GET vs POST

All API calls accept parameters via either GET (URL params) or POST (body). Use POST for large payloads like bulk ASIN/ISBN submissions — GET has a ~2048 character URL limit, so large ISBN lists must be batched. POST can send arbitrarily large fields in a single request.

Error Handling

Errors are indicated by a negative result_code field in the XML response, with a result_text description:

<?xml version="1.0" encoding="UTF-8"?>
<userids>
  <userid>
    <id>john_smith</id>
    <result_code>-1</result_code>
    <result_text>no data found</result_text>
  </userid>
</userids>

Authentication

The /api/userbook endpoint uses HTTP Basic Auth. A 302 response means rate limiting; a 401 or HTML error page means invalid credentials (users should use their BookMooch username, not email).

OverDrive API

OverDrive provides APIs for searching library digital collections and checking availability.

Authentication

Uses OAuth2 client credentials flow via https://oauth.overdrive.com/token. The returned bearer token is used for all subsequent API calls. Tokens are short-lived and should be fetched per-session.

Endpoints Used

Library infoGET /v1/libraries/{consortiumId} Returns collection token, website ID, and homepage URL. The collectionToken is required for all product/availability queries.

Product searchGET /v1/collections/{collectionToken}/products?q={query} Searches the library's digital catalog. Accepts a single query string (ISBN, title, or author). Does not support batch/bulk queries — there is no way to search multiple ISBNs in one call. Pagination via limit (default 25) and offset.

Availability (v2)GET /v2/collections/{collectionToken}/availability?products={id1},{id2},... Accepts up to 25 comma-separated product IDs per request. Returns copiesAvailable, copiesOwned, and hold counts. Product IDs (reserveId) come from search results. Cannot accept ISBNs directly — must resolve ISBN to product ID via search first.

Key Limitations

  • No bulk search: Each book requires its own search API call. For a shelf of 500 books, that's 500+ search calls. This is the main bottleneck.
  • Print ISBNs are not searchable: Goodreads shelves contain print ISBNs, but only digital ISBNs (ebook/audiobook format) are searchable via the identifiers parameter. Print ISBNs appear in otherFormatIdentifiers in responses but cannot be used as search input. This is why the code falls back to title+author matching when ISBN search returns no results.
  • Rate limits are undocumented: The API Usage Requirements say "honor any limitations we set" but don't publish specific numbers. The code uses Async::Semaphore.new(16) for concurrent requests.
  • Availability is product-ID-only: The v2 availability endpoint requires OverDrive product IDs, not ISBNs. A two-phase lookup (search then availability) is unavoidable without a local index.

Optimization Opportunities

Cache ISBN-to-product-ID mappings in the database. After the first lookup, store the mapping so repeat shelf checks skip the expensive search phase and go straight to availability batches. This would reduce repeat visits from O(n) search calls to O(new_books) searches + O(n/25) availability calls.

Local collection index (future). The products endpoint supports ?lastUpdateTime={timestamp} for incremental sync. Could paginate the entire library collection into a local table, then match ISBNs locally. Initial sync: 400-3,200 calls for a typical library (10k-80k titles at 25/page), then incremental updates. Eliminates per-book search calls entirely.

Current Implementation

Books are processed in chunks of 100 to bound memory. Each chunk completes the full pipeline (search -> expand editions -> fetch availability) before the next starts. Raw JSON response bodies are discarded after parsing. Timing and RSS memory usage are logged per-chunk for monitoring.

OOM / Memory Management

The app runs on Render's Starter plan (512MB RAM). The process starts at ~100MB and grows steadily until OOM kill at 512MB.

Root cause

There are two layers to the problem:

Layer 1: Per-request memory allocations that are never returned to the OS. Every GET / request leaked ~0.2-0.4MB of RSS, even though the homepage is a static marketing page with no DB queries or API calls. The leak came from middleware and analytics running on every request, including bot/monitor traffic hitting / every minute:

  • Sentry transaction tracing (traces_sample_rate = 0.1): The CaptureExceptions middleware clones the Sentry hub, creates a scope, stores the full Rack env hash in the scope, and creates transaction/span objects for 10% of requests. Under Falcon's fiber-based concurrency, hub clones stored in Thread.current may not clean up properly between fibers.
  • PostHog analytics on homepage: Analytics.track queued a PostHog event with a unique distinct_id (new session UUID) for every bot request. Useless analytics noise that allocated objects into PostHog's internal queue.
  • Session writes for bots: session['session_id'] ||= SecureRandom.uuid forced the Roda sessions plugin to encrypt and set a cookie on every request, even for bots that never send cookies back.

Layer 2: Memory that GC cannot reclaim. Even after Ruby's major GC collects objects (old_objects drops from 549k to 50k), RSS doesn't decrease -- it stays at 506MB and keeps climbing. This happens even with MALLOC_ARENA_MAX=2 set, ruling out simple glibc arena fragmentation. The retained memory likely comes from C-level allocations in OpenSSL (used by Sentry's HTTP transport and session encryption) and object-slot fragmentation in Ruby's heap pages.

Typical OOM timeline

Server starts at ~100MB. At 0.3MB/request with bot traffic every minute:

  • ~23 hours to reach 512MB and trigger SIGKILL
  • SIGKILL cannot be caught -- no Ruby error handler, no Sentry, nothing runs

Mitigations (code changes)

Homepage served before middleware (app.rb) -- r.root is now matched before enrich_sentry, session['session_id'] assignment, and identify_user. Bot traffic to / no longer creates sessions, Sentry scopes, or PostHog events. This eliminates the primary source of per-request allocations.

Sentry::Rack::CaptureExceptions middleware removed (app.rb) -- This middleware cloned the Sentry hub, created a scope storing the full Rack env, and ran session tracking on every request. Under Falcon's fiber/thread model, these allocations leaked ~0.2-0.4MB/request that was never reclaimed. Errors are still captured via Sentry.capture_exception in the app's rescue block and error_handler plugin. Also set traces_sample_rate = 0 in config.ru to disable transaction tracing.

Periodic GC.compact (lib/memory_logger.rb) -- When RSS exceeds 400MB, GC.compact runs every 100 requests. This consolidates the Ruby heap so free pages can be returned to the OS. Won't fully solve malloc fragmentation but helps with Ruby-level fragmentation.

Mitigations (Render env vars)

MALLOC_ARENA_MAX=2 (set in Render dashboard) -- Limits glibc to 2 memory arenas instead of 8 per thread. Heroku made this the default for all Ruby apps. Already set; insufficient on its own to prevent OOM -- the Sentry middleware removal was the critical fix.

Process.warmup (config.ru, production only) -- Ruby 3.3+ API that compacts the heap and optimizes GC after boot, before serving requests.

RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=1.3 (optional) -- Triggers major GC more frequently. Sam Saffron measured ~22% RSS reduction. Causes more GC pauses, acceptable at low traffic.

Diagnostic logging

MemoryLogger middleware (lib/memory_logger.rb) logs RSS and GC stats on every request. Runs in production only, skips /health and static assets. Logs twice per request -- START and END -- so the killing request is identifiable even after SIGKILL.

[mem] #42 START GET /goodreads/shelves rss=294.2MB
[mem] #42 END GET /goodreads/shelves status=200 duration=1234.5ms rss=312.4MB delta=+18.2MB heap_live=1823456 old_objects=982341 major_gc=1 minor_gc=3
[mem] #43 START GET /login rss=312.4MB
                                        <-- process killed here, no END line

How to read the logs: A START with no matching END is the request that caused OOM. Large positive delta values on END lines show which requests grow memory. The WARNING line fires when RSS exceeds 400MB. After the fix, look for [mem] GC.compact lines showing compaction results.

Why memory still grows (post-fix)

The mitigations above eliminated the biggest leak (bot traffic on /), but RSS still creeps up on non-homepage requests. Every authenticated request runs through this pipeline (app.rb lines 103-121):

  1. Session decryption/encryption -- Rodauth decrypts the incoming session cookie and re-encrypts the outgoing one via OpenSSL. Cipher contexts are C-level malloc allocations.
  2. Sentry scope calls -- enrich_sentry calls Sentry.set_user and Sentry.set_tags on every request, creating scope objects on the Sentry hub even without the middleware.
  3. PostHog identify on every request -- identify_user calls Analytics.alias_user and Analytics.identify for every logged-in request, pushing events onto PostHog's internal queue.
  4. DB query -- Account[rodauth.session_value] runs a database query on every authenticated request.

The problem isn't Ruby objects -- GC collects those fine (old_objects drops from 549k to 50k). The problem is glibc malloc fragmentation from C-level allocations. OpenSSL cipher contexts, Sentry internals, and database buffers are allocated via malloc(). When freed, they leave holes in the heap that glibc can't return to the OS. Falcon's fiber concurrency makes this worse -- fibers interleave allocations across memory pages, so no page is ever fully free.

GC.compact only helps Ruby heap pages. MALLOC_ARENA_MAX=2 limits arenas but doesn't prevent fragmentation within them.

Next steps for memory

malloc_trim gem -- Calls malloc_trim() after each major GC cycle to return freed glibc pages to the OS. ~1% CPU overhead, Linux only (which Render uses). This is the lowest-effort next step. Typical RSS reduction: 10-30%.

jemalloc -- A drop-in malloc replacement that returns memory to the OS far more aggressively. Used by GitLab, Discourse, and Mastodon. However, it requires a Docker deploy on Render (apt-get install libjemalloc2 + LD_PRELOAD), which is overkill unless malloc_trim proves insufficient. Typical RSS reduction: 25-40%.

Health-check-based restart -- Write a custom /health that returns 500 when RSS > 450MB. Render restarts after 60s of failed checks. This is a fallback, not a fix.

What doesn't help

  • Sentry / error_handler plugin -- SIGKILL terminates the process before any Ruby code can execute. These only catch Ruby exceptions.
  • Reducing TupleSpace TTL -- Cached entries are ~1-2KB each, negligible at this scale.
  • Wrapping OAuth calls in Sync do -- Inside Falcon, Sync do is a no-op (already in an async task). Net::HTTP calls are automatically non-blocking via Ruby's fiber scheduler.
  • Removing --verbose from Falcon -- Falcon's verbose middleware writes to stdout and doesn't buffer in memory.

Investigation log (May 2026)

Observed GET / requests every ~1 minute growing RSS by 0.2-0.4MB with major_gc=0 minor_gc=0 on most requests. Key data points:

02:58 rss=500.7MB  (WARNING threshold)
03:09 rss=506.3MB  old_objects drops 549201 -> 50934 (major GC ran, but RSS didn't shrink)
03:10 rss=507.4MB  old_objects=183943 (climbing back up)
03:13 rss=510.0MB  -> OOM kill, Render restarts process
03:14 rss=100.3MB  (fresh start, first request)

The fact that RSS didn't decrease after major GC -- even with MALLOC_ARENA_MAX=2 already set -- pointed to the Sentry middleware as the primary culprit. Sentry::Rack::CaptureExceptions clones the hub, creates scopes, and stores the Rack env on every request. Under Falcon's fiber/thread model, these allocations aren't properly reclaimed. Fix: remove Sentry middleware (keep manual error capture), move homepage route before session/analytics middleware, add GC.compact safety net.

Deployment (Render)

Deployed on Render with a persistent disk for SQLite at /var/data/production.db.

Render does not use the Procfile — commands are set in the dashboard under Settings:

Build command:

bundle install && bundle exec rake precompile

Start command:

bundle exec rake db:migrate && bundle exec falcon --verbose serve --threaded -n 2 -b http://0.0.0.0:${PORT}

Important notes

  • Render's persistent disk (/var/data) is only mounted at runtime, not during builds. Migrations must run in the start command.
  • Rake tasks in lib/tasks/ must not require database.rb at the top level — it calls FileUtils.mkdir_p('/var/data') which fails during builds. Require it lazily inside task bodies that need it.
  • The precompile task uses a bare Roda class (not the full App) to avoid loading all app dependencies during the build. app.rb also calls compile_assets at startup.
  • tailwindcss-ruby must stay in the top-level Gemfile group (not :development) because it's needed by the build step.

Routing

This app uses the roda-route-list plugin. This makes all the routes available in a /routes.json file.

Creating a self-signed certificate

openssl req -x509 -out localhost.crt -keyout localhost.key \
  -newkey rsa:2048 -nodes -sha256 \
  -subj '/CN=localhost' -extensions EXT -config <( \
   printf "[dn]\nCN=localhost\n[req]\ndistinguished_name = dn\n[EXT]\nsubjectAltName=DNS:localhost\nkeyUsage=digitalSignature\nextendedKeyUsage=serverAuth")

About

A ruby/rack app on Roda framework integrating literary APIs: bookmooch, goodreads, and overdrive.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors