Skip to content

warn if max_claims is greater than or equal to connection pool size#109

Open
bKP451 wants to merge 7 commits into
Betterment:mainfrom
bKP451:bikash/improve-worker-observability
Open

warn if max_claims is greater than or equal to connection pool size#109
bKP451 wants to merge 7 commits into
Betterment:mainfrom
bKP451:bikash/improve-worker-observability

Conversation

@bKP451

@bKP451 bKP451 commented May 28, 2026

Copy link
Copy Markdown

The Problem

Every job thread needs a database connection to do its work. But the worker process itself also needs one connection just to keep the lights on — polling for new jobs, locking them, and cleaning up when done.

If max_claims is set as high as (or higher than) the connection pool size, there aren't enough connections to go around:

MAX_CLAIMS = 5, POOL_SIZE = 5

main worker housekeeping ─🔑          ← needs 1
Job 1 ─🔑                        \
Job 2 ─🔑                         \
Job 3 ─🔑                          ├─ needs 5
Job 4 ─🔑                         /
Job 5 ─❌ (no connection left!)  /

6 requests  ▶  5 connections  ▶  somebody loses

Job 5 waits for a free connection, times out, and raises ActiveRecord::ConnectionTimeoutError. This is silent until it happens under load in production — with no obvious hint that the root cause is a misconfigured pool size.

What This PR Does

Adds a one-time check at worker startup. If max_claims >= connection_pool.size, the worker logs a clear warn-level message telling you exactly what to fix before any jobs run.

The fix is simple — reserve one connection for the worker itself:

# config/initializers/delayed_job.rb
Delayed::Worker.max_claims = ActiveRecord::Base.connection_pool.size - 1

More generally, if your jobs each need N connections and you want M concurrent threads, size your pool to at least N * (M + 1). For most apps N=1, so pool_size >= max_claims + 1 is the rule of thumb.

What Changed

  • Worker#start now calls check_connection_pool_config! before the run loop
  • check_connection_pool_config! warns at startup when max_claims >= connection_pool.size, with the suggested value to set
  • README documents the constraint with the formula and an example initializer
  • If connection pool introspection raises for any reason, the check is silently skipped — no impact on correctly configured workers

@bKP451 bKP451 force-pushed the bikash/improve-worker-observability branch from 6b615b8 to 0101b39 Compare May 28, 2026 04:36

@smudge smudge left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR!

I know you've marked it WIP/Temp, but I have also noticed that it's possible to trigger ActiveRecord::ConnectionTimeoutError if the pool size is lower than max_claims, so this fix definitely makes sense to me and I would be very open to addressing it. 👍

For the sake of reviewing changes independently, I think it would make sense to split the observability/logging changes out into a separate PR, and keep this PR focused on the thread pool / connection pool / max claims reconciliation. (LMK if that doesn't make sense!)

Comment thread lib/delayed/worker.rb Outdated
Comment on lines +257 to +266
def thread_pool_size(job_count)
return job_count unless Delayed::Job.respond_to?(:connection_pool)

pool_size = Delayed::Job.connection_pool.size
return job_count unless pool_size

[job_count, [pool_size - 1, 1].max].min
rescue StandardError
job_count
end

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a way to more proactively (e.g. during worker boot/initialization) establish if Delayed::Worker.max_claims > Delayed::Job.connection_pool.size rather than performing this thread_pool_size logic on every pickup loop. (My understanding is that Delayed::Job.connection_pool.size is informed by the pool size config in database.yml, and should not change once the app has loaded.)

The current pickup strategy is also intended to avoid picking up more work than the worker can immediately begin working off (to avoid holding unworked jobs in memory), so it may make sense to raise or warn up front (again, during boot / worker initialization) if a misconfiguration is detected.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smudge,

Yeah it is a great way to raise on warn upfront during app initializer ( config/initializers/delayed.rb ) for
Delayed::Worker.max_claims > Delayed::Job.connection_pool.size

On our project, our actual problem was max_claims being equal to pool_size. On investigation we found that main worker ( i.e server ) also needs to connect to DB to keep track of threads, locking, unlocking, polling etc. Therefor a worker thread will throw ActiveRecord::ConnectionTimeoutError if it cannot get DB connection after awaiting for a checkout_time ( i.e ActiveRecord::Base.connection_pool.checkout_timeout ). For long running jobs, DB connection won't get free for that worker thread and exception is raised

Job thread crashed with ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.001 seconds); all pooled connections were in use
Here 5.000 seconds is checkout_time

Problem illustration

   MAX_CLAIMS = 5, POOL_SIZE = 5


   main worker(server) housekeeping ─🔑          ← needs 1
   Job 1 ─🔑                        \
   Job 2 ─🔑                         \
   Job 3 ─🔑                          ├─ needs 5
   Job 4 ─🔑                         /
   Job 5 ─❌ (no key left!)         /

   6 requests  ▶  5 keys  ▶  somebody loses

We solved it by reducing concurrent worker threads to pool minus 1

Override MAX_CLAIMS at config/initializers/delayed.rb

# Reserve 1 DB connection for the worker's own housekeeping (polling + locking jobs).
db_pool_size = ActiveRecord::Base.connection_pool.size
Delayed::Worker.max_claims = [db_pool_size - 1, 1].max

WDYT ?

@smudge smudge Jun 8, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good callout -- we always include a buffer in our connection pool size (both for web and worker counts). We also sometimes need more than 1 connection per thread, but the same general rule applies there too -- basically, if your code generally needs N connections, and your max claims is M, you want N * (M + 1) connections. The most common case is N=1 though, and I think that would be easy enough to detect with >= (rather than the > I had originally proposed):

Delayed::Worker.max_claims >= Delayed::Job.connection_pool.size

@bKP451 bKP451 force-pushed the bikash/improve-worker-observability branch from cdf68fe to dea0914 Compare June 9, 2026 08:17
@bKP451 bKP451 changed the title Temp [ improve worker observability ] warn if max_claims is greater than or equal to connection pool size Jun 9, 2026
@bKP451 bKP451 force-pushed the bikash/improve-worker-observability branch from f0aa45c to bcc4212 Compare June 9, 2026 08:46
@bKP451 bKP451 requested a review from smudge June 9, 2026 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants