Skip to content

Fix flaky integration test#379

Open
stepchowfun wants to merge 4 commits into
mainfrom
fix-integration-test-flakiness
Open

Fix flaky integration test#379
stepchowfun wants to merge 4 commits into
mainfrom
fix-integration-test-flakiness

Conversation

@stepchowfun

Copy link
Copy Markdown
Owner

Replace the unreliable process-state polling in wait_for_docuum() with log-based synchronization.

Root cause: The old approach polled /proc/$PID/stat for state 'S' (interruptible sleep). But Docuum enters state 'S' whenever it blocks on any syscall — including wait() while running docker image rm inside vacuum(). So the test could proceed while Docuum was still mid-vacuum, causing a race where the LRU ordering was wrong by the time the next image was used.

Fix: Redirect Docuum's log output to a temporary file and synchronize on two specific messages that Docuum emits at well-defined points:

  • "Listening for Docker events" — emitted after the initial vacuum, just before the event loop starts
  • "Going back to sleep" — emitted at the very end of each event-loop iteration, after any vacuum has completed

A quiescence check (count stable for 2 seconds) ensures all events from a single docker container run (pull, create, start, die, destroy — typically 5 events) are fully processed before the test proceeds to the next step.

Status: Ready

Fixes: N/A

…process state

The previous `wait_for_docuum()` polled `/proc/$PID/stat` for the 'S'
(interruptible sleep) state. This is unreliable: the process can be in
state 'S' while blocked on `docker image rm` (inside `vacuum()`), causing
the test to proceed before Docuum has finished processing the current event.

The new approach redirects Docuum's log output to a file and synchronizes
on specific log messages that Docuum emits at known-safe points:
- "Listening for Docker events" — emitted after the initial vacuum, before
  the event loop starts
- "Going back to sleep" — emitted at the end of each event-loop iteration,
  after any vacuum has completed

A quiescence check (count stable for 2 seconds) ensures all events from a
single container run (pull, create, start, die, destroy) are fully
processed before the test proceeds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@stepchowfun stepchowfun force-pushed the fix-integration-test-flakiness branch from 18f2dc9 to 91e0261 Compare June 28, 2026 00:24
stepchowfun and others added 3 commits June 27, 2026 17:42
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The quiescence approach (wait 2s with no new log messages) was still racy
on loaded CI machines. Instead, wait_for_docuum now takes an image digest
prefix as an argument, waits for that string to appear in the log (proving
Docuum received the relevant event), then waits for "Going back to sleep"
to appear after it (proving Docuum finished processing that event, including
any vacuum).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant