warn if max_claims is greater than or equal to connection pool size#109
warn if max_claims is greater than or equal to connection pool size#109bKP451 wants to merge 7 commits into
Conversation
6b615b8 to
0101b39
Compare
smudge
left a comment
There was a problem hiding this comment.
Thanks for this PR!
I know you've marked it WIP/Temp, but I have also noticed that it's possible to trigger ActiveRecord::ConnectionTimeoutError if the pool size is lower than max_claims, so this fix definitely makes sense to me and I would be very open to addressing it. 👍
For the sake of reviewing changes independently, I think it would make sense to split the observability/logging changes out into a separate PR, and keep this PR focused on the thread pool / connection pool / max claims reconciliation. (LMK if that doesn't make sense!)
| def thread_pool_size(job_count) | ||
| return job_count unless Delayed::Job.respond_to?(:connection_pool) | ||
|
|
||
| pool_size = Delayed::Job.connection_pool.size | ||
| return job_count unless pool_size | ||
|
|
||
| [job_count, [pool_size - 1, 1].max].min | ||
| rescue StandardError | ||
| job_count | ||
| end |
There was a problem hiding this comment.
There may be a way to more proactively (e.g. during worker boot/initialization) establish if Delayed::Worker.max_claims > Delayed::Job.connection_pool.size rather than performing this thread_pool_size logic on every pickup loop. (My understanding is that Delayed::Job.connection_pool.size is informed by the pool size config in database.yml, and should not change once the app has loaded.)
The current pickup strategy is also intended to avoid picking up more work than the worker can immediately begin working off (to avoid holding unworked jobs in memory), so it may make sense to raise or warn up front (again, during boot / worker initialization) if a misconfiguration is detected.
There was a problem hiding this comment.
Yeah it is a great way to raise on warn upfront during app initializer ( config/initializers/delayed.rb ) for
Delayed::Worker.max_claims > Delayed::Job.connection_pool.size
On our project, our actual problem was max_claims being equal to pool_size. On investigation we found that main worker ( i.e server ) also needs to connect to DB to keep track of threads, locking, unlocking, polling etc. Therefor a worker thread will throw ActiveRecord::ConnectionTimeoutError if it cannot get DB connection after awaiting for a checkout_time ( i.e ActiveRecord::Base.connection_pool.checkout_timeout ). For long running jobs, DB connection won't get free for that worker thread and exception is raised
Job thread crashed with ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.001 seconds); all pooled connections were in use
Here 5.000 seconds is checkout_time
Problem illustration
MAX_CLAIMS = 5, POOL_SIZE = 5
main worker(server) housekeeping ─🔑 ← needs 1
Job 1 ─🔑 \
Job 2 ─🔑 \
Job 3 ─🔑 ├─ needs 5
Job 4 ─🔑 /
Job 5 ─❌ (no key left!) /
6 requests ▶ 5 keys ▶ somebody loses
We solved it by reducing concurrent worker threads to pool minus 1
Override MAX_CLAIMS at config/initializers/delayed.rb
# Reserve 1 DB connection for the worker's own housekeeping (polling + locking jobs).
db_pool_size = ActiveRecord::Base.connection_pool.size
Delayed::Worker.max_claims = [db_pool_size - 1, 1].maxWDYT ?
There was a problem hiding this comment.
Yes, that's a good callout -- we always include a buffer in our connection pool size (both for web and worker counts). We also sometimes need more than 1 connection per thread, but the same general rule applies there too -- basically, if your code generally needs N connections, and your max claims is M, you want N * (M + 1) connections. The most common case is N=1 though, and I think that would be easy enough to detect with >= (rather than the > I had originally proposed):
Delayed::Worker.max_claims >= Delayed::Job.connection_pool.size
…ze, handle_thread_error)
cdf68fe to
dea0914
Compare
f0aa45c to
bcc4212
Compare
The Problem
Every job thread needs a database connection to do its work. But the worker process itself also needs one connection just to keep the lights on — polling for new jobs, locking them, and cleaning up when done.
If
max_claimsis set as high as (or higher than) the connection pool size, there aren't enough connections to go around:Job 5 waits for a free connection, times out, and raises
ActiveRecord::ConnectionTimeoutError. This is silent until it happens under load in production — with no obvious hint that the root cause is a misconfigured pool size.What This PR Does
Adds a one-time check at worker startup. If
max_claims >= connection_pool.size, the worker logs a clearwarn-level message telling you exactly what to fix before any jobs run.The fix is simple — reserve one connection for the worker itself:
More generally, if your jobs each need
Nconnections and you wantMconcurrent threads, size your pool to at leastN * (M + 1). For most appsN=1, sopool_size >= max_claims + 1is the rule of thumb.What Changed
Worker#startnow callscheck_connection_pool_config!before the run loopcheck_connection_pool_config!warns at startup whenmax_claims >= connection_pool.size, with the suggested value to set