spock_apply: surface root cause when a discard is not attributable to a row#497
spock_apply: surface root cause when a discard is not attributable to a row#497rasifr wants to merge 2 commits into
Conversation
rasifr
commented
Jun 10, 2026
📝 WalkthroughWalkthroughA new helper centralizes construction of synthetic collateral discard messages; two exception logging call sites are updated to use it, commit-time failures are marked non-attributable, and a TAP test validates root-cause capture for row- and commit-time failures. ChangesException root-cause message centralization
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Infer (1.2.0)src/spock_apply.csrc/spock_apply.c:12:10: fatal error: 'postgres.h' file not found ... [truncated 683 characters] ... "/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/lib/clang/18/include" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Duplication | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/tap/t/020_exception_root_cause.pl (1)
37-39: 💤 Low valueConsider more robust wait for config propagation.
The
sleep(2)at line 39 is a fixed time-based wait that could be flaky on slow or heavily-loaded systems. Whilepg_reload_conf()is immediate, the apply worker may need time to pick up the new setting. Consider either:
- Increasing the sleep duration to reduce flakiness (e.g.,
sleep(5))- Polling
SHOW spock.exception_behaviouruntil it returnstransdiscardHowever, since this is a TAP test in a controlled environment and the current approach is simple and widely used in the codebase, the current implementation is acceptable.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/tap/t/020_exception_root_cause.pl` around lines 37 - 39, The fixed sleep(2) after psql_or_bail calls can be flaky; replace it with a robust wait that polls the cluster until the new setting is visible: after psql_or_bail(2, "SELECT pg_reload_conf()"), loop calling psql_or_bail or similar to run "SHOW spock.exception_behaviour" and break only when it returns "transdiscard" (with a short sleep/backoff and an overall timeout), or alternatively increase sleep(2) to a larger value like sleep(5) if you prefer the simpler approach; update references in this test around psql_or_bail, pg_reload_conf and sleep to implement the polling and timeout behavior.src/spock_apply.c (1)
1213-1236: 💤 Low valueConsider defensive NULL check for
exception_log_ptr.The function directly accesses
exception_log_ptr[my_exception_log_index]at line 1226 without checking ifexception_log_ptris NULL. While the existing codebase has similar patterns (e.g., line 1267) and call sites are protected byshould_log_exception(), adding a defensive check would improve robustness:if (exception_log_ptr == NULL || my_exception_log_index < 0) return psprintf("%s: tuple discarded (exception log unavailable)", get_exception_behaviour_name());This guards against unexpected call paths or initialization order issues.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/spock_apply.c` around lines 1213 - 1236, The function discard_collateral_message reads exception_log_ptr[my_exception_log_index] without guarding that exception_log_ptr is non-NULL or that my_exception_log_index is valid; add a defensive check at the top of discard_collateral_message to detect exception_log_ptr == NULL or my_exception_log_index < 0 (or out of bounds) and return a sensible psprintf message like "<behaviour>: tuple discarded (exception log unavailable)" using get_exception_behaviour_name(), otherwise proceed to access SpockExceptionLog as before; reference symbols: discard_collateral_message, exception_log_ptr, my_exception_log_index, SpockExceptionLog, get_exception_behaviour_name().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/spock_apply.c`:
- Around line 1213-1236: The function discard_collateral_message reads
exception_log_ptr[my_exception_log_index] without guarding that
exception_log_ptr is non-NULL or that my_exception_log_index is valid; add a
defensive check at the top of discard_collateral_message to detect
exception_log_ptr == NULL or my_exception_log_index < 0 (or out of bounds) and
return a sensible psprintf message like "<behaviour>: tuple discarded (exception
log unavailable)" using get_exception_behaviour_name(), otherwise proceed to
access SpockExceptionLog as before; reference symbols:
discard_collateral_message, exception_log_ptr, my_exception_log_index,
SpockExceptionLog, get_exception_behaviour_name().
In `@tests/tap/t/020_exception_root_cause.pl`:
- Around line 37-39: The fixed sleep(2) after psql_or_bail calls can be flaky;
replace it with a robust wait that polls the cluster until the new setting is
visible: after psql_or_bail(2, "SELECT pg_reload_conf()"), loop calling
psql_or_bail or similar to run "SHOW spock.exception_behaviour" and break only
when it returns "transdiscard" (with a short sleep/backoff and an overall
timeout), or alternatively increase sleep(2) to a larger value like sleep(5) if
you prefer the simpler approach; update references in this test around
psql_or_bail, pg_reload_conf and sleep to implement the polling and timeout
behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 600e23c1-813b-4b1f-ad3d-60492480d9a5
📒 Files selected for processing (2)
src/spock_apply.ctests/tap/t/020_exception_root_cause.pl
danolivo
left a comment
There was a problem hiding this comment.
- handle_commit increments this counter too. What if we fail somewhere in the middle of a commit, due to a deferred unique or FK constraint, as an example? Not sure if anything would go wrong (except incorrect reference in the exceptions table), but there should be a regression test to verify it is OK.
- discard_collateral_message() dereferences exception_log_ptr[my_exception_log_index] with no guard; the SQL caller asserts the index, but the DML caller doesn't. An Assert(exception_log_ptr != NULL && my_exception_log_index >= 0) in the helper would match the paranoia level of the surrounding code.
- "tuple discarded" - Does this make sense in the case of SQL path (DDL command)? What tuple has been discarded in such a case?
… a row
TRANSDISCARD/SUB_DISABLE replay records the real error on the row that
failed and points the other discarded rows at it via command_counter
(cheaper than repeating the message on every row). When the failure is not
attributable to a specific row, failed_action is 0 and that pointer dangles
-- no row's counter is 0 -- so the captured cause was lost, e.g. "tuple
discarded due to exception at command_counter 0".
Extract the collateral-discard message into discard_collateral_message(),
shared by the DML and SQL replay paths, and surface the captured
initial_error_message when failed_action is 0 instead of an empty pointer.
The helper asserts the exception-log index is valid and names the discarded
entry ("tuple" vs "statement") so the SQL/DDL path does not claim a tuple
was discarded.
A failure during COMMIT (e.g. a deferred constraint trigger) is likewise
not attributable to any replayed row: handle_commit has already bumped the
action counter, so the command_counter pointer dangled. Treat it as
non-attributable so the captured root cause is surfaced there too.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A two-row transaction fails on its second row (subscriber ALWAYS BEFORE INSERT trigger RAISEs). Asserts the failing row carries the real error and the non-failing row points at it via command_counter (TRANSDISCARD). Also covers the commit-time discard: a DEFERRABLE INITIALLY DEFERRED constraint trigger (ENABLE ALWAYS) that RAISEs at COMMIT exercises the non-row-attributable path and asserts the root cause is surfaced, the transaction is discarded, and apply recovers. Polls for the applied GUC instead of a fixed sleep. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7ea8acd to
6c32275
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/tap/t/020_exception_root_cause.pl (2)
20-24: ⚡ Quick winUpdate stale scenario comments to match the actual assertions.
Line 23 says the non-failing row should contain the real
RAISEtext, but the test asserts acommand_counterpointer for that row. Also, the Scenario B header still says “deferred unique” while the setup uses a deferred constraint trigger.✏️ Suggested doc/comment fix
-# path. Its exception_log entry must contain the real RAISE text. +# path. Its exception_log entry should point to the failing row via +# command_counter, while the failing row carries the real RAISE text. @@ -# Scenario B: commit-time failure (deferred unique), not attributable to a row +# Scenario B: commit-time failure (deferred constraint trigger), not attributable to a rowAlso applies to: 146-147
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/tap/t/020_exception_root_cause.pl` around lines 20 - 24, Update the stale scenario comments in tests/tap/t/020_exception_root_cause.pl to match the actual assertions: change the Scenario A comment that currently says the non-failing row "should contain the real RAISE text" to instead say the test asserts a command_counter pointer for that row (errmsg==NULL path), and update the Scenario B header text from "deferred unique" to "deferred constraint trigger"; apply the same comment corrections where duplicated later (lines corresponding to the second occurrence around 146-147).
167-168: ⚡ Quick winHarden the commit-time assertion by explicitly rejecting dangling pointer text.
The current
likecheck validates root-cause presence, but won’t fail ifcommand_counter 0also appears. Add a negative assertion here.✅ Suggested assertion hardening
like($defcon_msg, qr/deferredboom/, 'commit-time failure surfaces the captured root cause (not a dangling command_counter)'); +unlike($defcon_msg, qr/command_counter\s+0/, + 'commit-time failure message does not contain a dangling command_counter 0 pointer');🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/tap/t/020_exception_root_cause.pl` around lines 167 - 168, The existing test asserts $defcon_msg matches /deferredboom/ but may still pass if it also contains a dangling pointer string like "command_counter 0"; update the test to harden it by adding an explicit negative assertion that $defcon_msg does NOT match /command_counter\s*0/ (use the test harness's unlike/unmatched assertion with a clear failure message) so the commit-time failure must show the captured root cause and not a dangling command_counter.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/tap/t/020_exception_root_cause.pl`:
- Around line 20-24: Update the stale scenario comments in
tests/tap/t/020_exception_root_cause.pl to match the actual assertions: change
the Scenario A comment that currently says the non-failing row "should contain
the real RAISE text" to instead say the test asserts a command_counter pointer
for that row (errmsg==NULL path), and update the Scenario B header text from
"deferred unique" to "deferred constraint trigger"; apply the same comment
corrections where duplicated later (lines corresponding to the second occurrence
around 146-147).
- Around line 167-168: The existing test asserts $defcon_msg matches
/deferredboom/ but may still pass if it also contains a dangling pointer string
like "command_counter 0"; update the test to harden it by adding an explicit
negative assertion that $defcon_msg does NOT match /command_counter\s*0/ (use
the test harness's unlike/unmatched assertion with a clear failure message) so
the commit-time failure must show the captured root cause and not a dangling
command_counter.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b34791cb-dcc4-4629-9c92-b4c9870ab02d
📒 Files selected for processing (2)
src/spock_apply.ctests/tap/t/020_exception_root_cause.pl