spock_apply: surface root cause when a discard is not attributable to a row by rasifr · Pull Request #497 · pgEdge/spock

rasifr · 2026-06-10T10:24:54Z

TRANSDISCARD/SUB_DISABLE replay records the real error on the row that
failed and points the other discarded rows at it via command_counter
(cheaper than repeating the message on every row). When the failure is not
attributable to a specific row, failed_action is 0 and that pointer dangles
-- no row's counter is 0 -- so the captured cause was lost, e.g. "tuple
discarded due to exception at command_counter 0".

Extract the collateral-discard message into discard_collateral_message(),
shared by the DML and SQL replay paths, and surface the captured
initial_error_message when failed_action is 0 instead of an empty pointer.

coderabbitai · 2026-06-10T10:25:07Z

📝 Walkthrough

Walkthrough

A new helper centralizes construction of synthetic collateral discard messages; two exception logging call sites are updated to use it, commit-time failures are marked non-attributable, and a TAP test validates root-cause capture for row- and commit-time failures.

Changes

Exception root-cause message centralization

Layer / File(s)	Summary
Collateral message helper definition `src/spock_apply.c`	New static helper `discard_collateral_message()` constructs collateral discard text using exception behaviour name and the active exception log's `failed_action` and/or `initial_error_message`. Comment updated to document delegated construction.
Exception logging call sites refactored `src/spock_apply.c`	`log_insert_exception()` and the queued-SQL collateral discard path in `handle_sql_or_exception()` now call `discard_collateral_message()` instead of formatting the failed-action-only string inline.
Commit-time attribution fix `src/spock_apply.c`	`apply_work` exception handler now clears `failed_action` for COMMIT-context failures so later collateral messages are not attributed to a non-existent tuple/command_counter.
Root-cause logging test `tests/tap/t/020_exception_root_cause.pl`	New TAP test provisions a 2-node cluster, sets `spock.exception_behaviour = transdiscard`, installs triggers to fail one row and to fail at COMMIT, and asserts `spock.exception_log` entries capture either the real root-cause message or a command_counter reference; verifies apply resumes after commit-time discard.

Poem

🐰 I stitched the messages tidy and bright,
Collateral errors now land just right.
One helper hums, two callers sing,
Tests confirm the root-cause ring,
The rabbit nods—replication's light. ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: surfacing the root cause error message when a discard happens and cannot be attributed to a specific row.
Description check	✅ Passed	The description accurately explains the problem being solved and the solution approach, relating directly to the changeset's modifications.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/exception-log-surface-root-cause

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Infer (1.2.0)

src/spock_apply.c

src/spock_apply.c:12:10: fatal error: 'postgres.h' file not found
12 | #include "postgres.h"
| ^~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/6c322752efb4c28d6a100735962da53749ee1f77-f72196100ba30b8c/tmp/clang_command_.tmp.19ecbe.txt
++Contents of '/tmp/coderabbit-infer/6c322752efb4c28d6a100735962da53749ee1f77-f72196100ba30b8c/tmp/clang_command_.tmp.19ecbe.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"
"x86_64-unknown-linux-gnu" "-emit-obj" "-mrelax-all" "-disable-free"
"-clear-ast

... [truncated 683 characters] ...

"/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/lib/clang/18/include"
"-internal-isystem" "/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-ferror-limit" "19" "-fgnuc-version=4.2.1" "-fskip-odr-check-in-gmf"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/f72196100ba30b8c/file.o" "-x" "c"
"src/spock_apply.c" "-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codacy-production · 2026-06-10T10:26:02Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 duplication

Metric Results

Duplication 0

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

coderabbitai

🧹 Nitpick comments (2)

tests/tap/t/020_exception_root_cause.pl (1)
37-39: 💤 Low value

Consider more robust wait for config propagation.

The sleep(2) at line 39 is a fixed time-based wait that could be flaky on slow or heavily-loaded systems. While pg_reload_conf() is immediate, the apply worker may need time to pick up the new setting. Consider either:

Increasing the sleep duration to reduce flakiness (e.g., sleep(5))

Polling SHOW spock.exception_behaviour until it returns transdiscard

However, since this is a TAP test in a controlled environment and the current approach is simple and widely used in the codebase, the current implementation is acceptable.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/tap/t/020_exception_root_cause.pl` around lines 37 - 39, The fixed
sleep(2) after psql_or_bail calls can be flaky; replace it with a robust wait
that polls the cluster until the new setting is visible: after psql_or_bail(2,
"SELECT pg_reload_conf()"), loop calling psql_or_bail or similar to run "SHOW
spock.exception_behaviour" and break only when it returns "transdiscard" (with a
short sleep/backoff and an overall timeout), or alternatively increase sleep(2)
to a larger value like sleep(5) if you prefer the simpler approach; update
references in this test around psql_or_bail, pg_reload_conf and sleep to
implement the polling and timeout behavior.
src/spock_apply.c (1)
1213-1236: 💤 Low value

Consider defensive NULL check for exception_log_ptr.

The function directly accesses exception_log_ptr[my_exception_log_index] at line 1226 without checking if exception_log_ptr is NULL. While the existing codebase has similar patterns (e.g., line 1267) and call sites are protected by should_log_exception(), adding a defensive check would improve robustness:
if (exception_log_ptr == NULL || my_exception_log_index < 0)
    return psprintf("%s: tuple discarded (exception log unavailable)",
                    get_exception_behaviour_name());
This guards against unexpected call paths or initialization order issues.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/spock_apply.c` around lines 1213 - 1236, The function
discard_collateral_message reads exception_log_ptr[my_exception_log_index]
without guarding that exception_log_ptr is non-NULL or that
my_exception_log_index is valid; add a defensive check at the top of
discard_collateral_message to detect exception_log_ptr == NULL or
my_exception_log_index < 0 (or out of bounds) and return a sensible psprintf
message like "<behaviour>: tuple discarded (exception log unavailable)" using
get_exception_behaviour_name(), otherwise proceed to access SpockExceptionLog as
before; reference symbols: discard_collateral_message, exception_log_ptr,
my_exception_log_index, SpockExceptionLog, get_exception_behaviour_name().

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/spock_apply.c`:
- Around line 1213-1236: The function discard_collateral_message reads
exception_log_ptr[my_exception_log_index] without guarding that
exception_log_ptr is non-NULL or that my_exception_log_index is valid; add a
defensive check at the top of discard_collateral_message to detect
exception_log_ptr == NULL or my_exception_log_index < 0 (or out of bounds) and
return a sensible psprintf message like "<behaviour>: tuple discarded (exception
log unavailable)" using get_exception_behaviour_name(), otherwise proceed to
access SpockExceptionLog as before; reference symbols:
discard_collateral_message, exception_log_ptr, my_exception_log_index,
SpockExceptionLog, get_exception_behaviour_name().

In `@tests/tap/t/020_exception_root_cause.pl`:
- Around line 37-39: The fixed sleep(2) after psql_or_bail calls can be flaky;
replace it with a robust wait that polls the cluster until the new setting is
visible: after psql_or_bail(2, "SELECT pg_reload_conf()"), loop calling
psql_or_bail or similar to run "SHOW spock.exception_behaviour" and break only
when it returns "transdiscard" (with a short sleep/backoff and an overall
timeout), or alternatively increase sleep(2) to a larger value like sleep(5) if
you prefer the simpler approach; update references in this test around
psql_or_bail, pg_reload_conf and sleep to implement the polling and timeout
behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 600e23c1-813b-4b1f-ad3d-60492480d9a5

📥 Commits

Reviewing files that changed from the base of the PR and between 94f4ad6 and 7ea8acd.

📒 Files selected for processing (2)

src/spock_apply.c
tests/tap/t/020_exception_root_cause.pl

danolivo

handle_commit increments this counter too. What if we fail somewhere in the middle of a commit, due to a deferred unique or FK constraint, as an example? Not sure if anything would go wrong (except incorrect reference in the exceptions table), but there should be a regression test to verify it is OK.
discard_collateral_message() dereferences exception_log_ptr[my_exception_log_index] with no guard; the SQL caller asserts the index, but the DML caller doesn't. An Assert(exception_log_ptr != NULL && my_exception_log_index >= 0) in the helper would match the paranoia level of the surrounding code.
"tuple discarded" - Does this make sense in the case of SQL path (DDL command)? What tuple has been discarded in such a case?

… a row TRANSDISCARD/SUB_DISABLE replay records the real error on the row that failed and points the other discarded rows at it via command_counter (cheaper than repeating the message on every row). When the failure is not attributable to a specific row, failed_action is 0 and that pointer dangles -- no row's counter is 0 -- so the captured cause was lost, e.g. "tuple discarded due to exception at command_counter 0". Extract the collateral-discard message into discard_collateral_message(), shared by the DML and SQL replay paths, and surface the captured initial_error_message when failed_action is 0 instead of an empty pointer. The helper asserts the exception-log index is valid and names the discarded entry ("tuple" vs "statement") so the SQL/DDL path does not claim a tuple was discarded. A failure during COMMIT (e.g. a deferred constraint trigger) is likewise not attributable to any replayed row: handle_commit has already bumped the action counter, so the command_counter pointer dangled. Treat it as non-attributable so the captured root cause is surfaced there too. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A two-row transaction fails on its second row (subscriber ALWAYS BEFORE INSERT trigger RAISEs). Asserts the failing row carries the real error and the non-failing row points at it via command_counter (TRANSDISCARD). Also covers the commit-time discard: a DEFERRABLE INITIALLY DEFERRED constraint trigger (ENABLE ALWAYS) that RAISEs at COMMIT exercises the non-row-attributable path and asserts the root cause is surfaced, the transaction is discarded, and apply recovers. Polls for the applied GUC instead of a fixed sleep. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (2)

tests/tap/t/020_exception_root_cause.pl (2)

20-24: ⚡ Quick win

Update stale scenario comments to match the actual assertions.

Line 23 says the non-failing row should contain the real RAISE text, but the test asserts a command_counter pointer for that row. Also, the Scenario B header still says “deferred unique” while the setup uses a deferred constraint trigger.

✏️ Suggested doc/comment fix

-# path. Its exception_log entry must contain the real RAISE text.
+# path. Its exception_log entry should point to the failing row via
+# command_counter, while the failing row carries the real RAISE text.
@@
-# Scenario B: commit-time failure (deferred unique), not attributable to a row
+# Scenario B: commit-time failure (deferred constraint trigger), not attributable to a row

Also applies to: 146-147

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/tap/t/020_exception_root_cause.pl` around lines 20 - 24, Update the
stale scenario comments in tests/tap/t/020_exception_root_cause.pl to match the
actual assertions: change the Scenario A comment that currently says the
non-failing row "should contain the real RAISE text" to instead say the test
asserts a command_counter pointer for that row (errmsg==NULL path), and update
the Scenario B header text from "deferred unique" to "deferred constraint
trigger"; apply the same comment corrections where duplicated later (lines
corresponding to the second occurrence around 146-147).

167-168: ⚡ Quick win

Harden the commit-time assertion by explicitly rejecting dangling pointer text.

The current like check validates root-cause presence, but won’t fail if command_counter 0 also appears. Add a negative assertion here.

✅ Suggested assertion hardening

 like($defcon_msg, qr/deferredboom/,
     'commit-time failure surfaces the captured root cause (not a dangling command_counter)');
+unlike($defcon_msg, qr/command_counter\s+0/,
+    'commit-time failure message does not contain a dangling command_counter 0 pointer');

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/tap/t/020_exception_root_cause.pl` around lines 167 - 168, The existing
test asserts $defcon_msg matches /deferredboom/ but may still pass if it also
contains a dangling pointer string like "command_counter 0"; update the test to
harden it by adding an explicit negative assertion that $defcon_msg does NOT
match /command_counter\s*0/ (use the test harness's unlike/unmatched assertion
with a clear failure message) so the commit-time failure must show the captured
root cause and not a dangling command_counter.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/tap/t/020_exception_root_cause.pl`:
- Around line 20-24: Update the stale scenario comments in
tests/tap/t/020_exception_root_cause.pl to match the actual assertions: change
the Scenario A comment that currently says the non-failing row "should contain
the real RAISE text" to instead say the test asserts a command_counter pointer
for that row (errmsg==NULL path), and update the Scenario B header text from
"deferred unique" to "deferred constraint trigger"; apply the same comment
corrections where duplicated later (lines corresponding to the second occurrence
around 146-147).
- Around line 167-168: The existing test asserts $defcon_msg matches
/deferredboom/ but may still pass if it also contains a dangling pointer string
like "command_counter 0"; update the test to harden it by adding an explicit
negative assertion that $defcon_msg does NOT match /command_counter\s*0/ (use
the test harness's unlike/unmatched assertion with a clear failure message) so
the commit-time failure must show the captured root cause and not a dangling
command_counter.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b34791cb-dcc4-4629-9c92-b4c9870ab02d

📥 Commits

Reviewing files that changed from the base of the PR and between 7ea8acd and 6c32275.

📒 Files selected for processing (2)

src/spock_apply.c
tests/tap/t/020_exception_root_cause.pl

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

rasifr requested a review from danolivo June 10, 2026 11:35

danolivo requested changes Jun 10, 2026

View reviewed changes

rasifr and others added 2 commits June 10, 2026 21:02

rasifr force-pushed the fix/exception-log-surface-root-cause branch from 7ea8acd to 6c32275 Compare June 10, 2026 16:03

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spock_apply: surface root cause when a discard is not attributable to a row#497

spock_apply: surface root cause when a discard is not attributable to a row#497
rasifr wants to merge 2 commits into
mainfrom
fix/exception-log-surface-root-cause

rasifr commented Jun 10, 2026

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

Walkthrough

Changes

Poem

Uh oh!

codacy-production Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

danolivo left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rasifr commented Jun 10, 2026

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

Uh oh!

codacy-production Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

danolivo left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

codacy-production Bot commented Jun 10, 2026 •

edited

Loading