Skip to content

spock_apply: surface root cause when a discard is not attributable to a row#497

Open
rasifr wants to merge 2 commits into
mainfrom
fix/exception-log-surface-root-cause
Open

spock_apply: surface root cause when a discard is not attributable to a row#497
rasifr wants to merge 2 commits into
mainfrom
fix/exception-log-surface-root-cause

Conversation

@rasifr

@rasifr rasifr commented Jun 10, 2026

Copy link
Copy Markdown
Member
TRANSDISCARD/SUB_DISABLE replay records the real error on the row that
failed and points the other discarded rows at it via command_counter
(cheaper than repeating the message on every row). When the failure is not
attributable to a specific row, failed_action is 0 and that pointer dangles
-- no row's counter is 0 -- so the captured cause was lost, e.g. "tuple
discarded due to exception at command_counter 0".

Extract the collateral-discard message into discard_collateral_message(),
shared by the DML and SQL replay paths, and surface the captured
initial_error_message when failed_action is 0 instead of an empty pointer.

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

A new helper centralizes construction of synthetic collateral discard messages; two exception logging call sites are updated to use it, commit-time failures are marked non-attributable, and a TAP test validates root-cause capture for row- and commit-time failures.

Changes

Exception root-cause message centralization

Layer / File(s) Summary
Collateral message helper definition
src/spock_apply.c
New static helper discard_collateral_message() constructs collateral discard text using exception behaviour name and the active exception log's failed_action and/or initial_error_message. Comment updated to document delegated construction.
Exception logging call sites refactored
src/spock_apply.c
log_insert_exception() and the queued-SQL collateral discard path in handle_sql_or_exception() now call discard_collateral_message() instead of formatting the failed-action-only string inline.
Commit-time attribution fix
src/spock_apply.c
apply_work exception handler now clears failed_action for COMMIT-context failures so later collateral messages are not attributed to a non-existent tuple/command_counter.
Root-cause logging test
tests/tap/t/020_exception_root_cause.pl
New TAP test provisions a 2-node cluster, sets spock.exception_behaviour = transdiscard, installs triggers to fail one row and to fail at COMMIT, and asserts spock.exception_log entries capture either the real root-cause message or a command_counter reference; verifies apply resumes after commit-time discard.

Poem

🐰 I stitched the messages tidy and bright,
Collateral errors now land just right.
One helper hums, two callers sing,
Tests confirm the root-cause ring,
The rabbit nods—replication's light. ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: surfacing the root cause error message when a discard happens and cannot be attributed to a specific row.
Description check ✅ Passed The description accurately explains the problem being solved and the solution approach, relating directly to the changeset's modifications.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/exception-log-surface-root-cause

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Infer (1.2.0)
src/spock_apply.c

src/spock_apply.c:12:10: fatal error: 'postgres.h' file not found
12 | #include "postgres.h"
| ^~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/6c322752efb4c28d6a100735962da53749ee1f77-f72196100ba30b8c/tmp/clang_command_.tmp.19ecbe.txt
++Contents of '/tmp/coderabbit-infer/6c322752efb4c28d6a100735962da53749ee1f77-f72196100ba30b8c/tmp/clang_command_.tmp.19ecbe.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"
"x86_64-unknown-linux-gnu" "-emit-obj" "-mrelax-all" "-disable-free"
"-clear-ast

... [truncated 683 characters] ...

"/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/lib/clang/18/include"
"-internal-isystem" "/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-ferror-limit" "19" "-fgnuc-version=4.2.1" "-fskip-odr-check-in-gmf"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/f72196100ba30b8c/file.o" "-x" "c"
"src/spock_apply.c" "-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production

codacy-production Bot commented Jun 10, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 duplication

Metric Results
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/tap/t/020_exception_root_cause.pl (1)

37-39: 💤 Low value

Consider more robust wait for config propagation.

The sleep(2) at line 39 is a fixed time-based wait that could be flaky on slow or heavily-loaded systems. While pg_reload_conf() is immediate, the apply worker may need time to pick up the new setting. Consider either:

  1. Increasing the sleep duration to reduce flakiness (e.g., sleep(5))
  2. Polling SHOW spock.exception_behaviour until it returns transdiscard

However, since this is a TAP test in a controlled environment and the current approach is simple and widely used in the codebase, the current implementation is acceptable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/tap/t/020_exception_root_cause.pl` around lines 37 - 39, The fixed
sleep(2) after psql_or_bail calls can be flaky; replace it with a robust wait
that polls the cluster until the new setting is visible: after psql_or_bail(2,
"SELECT pg_reload_conf()"), loop calling psql_or_bail or similar to run "SHOW
spock.exception_behaviour" and break only when it returns "transdiscard" (with a
short sleep/backoff and an overall timeout), or alternatively increase sleep(2)
to a larger value like sleep(5) if you prefer the simpler approach; update
references in this test around psql_or_bail, pg_reload_conf and sleep to
implement the polling and timeout behavior.
src/spock_apply.c (1)

1213-1236: 💤 Low value

Consider defensive NULL check for exception_log_ptr.

The function directly accesses exception_log_ptr[my_exception_log_index] at line 1226 without checking if exception_log_ptr is NULL. While the existing codebase has similar patterns (e.g., line 1267) and call sites are protected by should_log_exception(), adding a defensive check would improve robustness:

if (exception_log_ptr == NULL || my_exception_log_index < 0)
    return psprintf("%s: tuple discarded (exception log unavailable)",
                    get_exception_behaviour_name());

This guards against unexpected call paths or initialization order issues.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/spock_apply.c` around lines 1213 - 1236, The function
discard_collateral_message reads exception_log_ptr[my_exception_log_index]
without guarding that exception_log_ptr is non-NULL or that
my_exception_log_index is valid; add a defensive check at the top of
discard_collateral_message to detect exception_log_ptr == NULL or
my_exception_log_index < 0 (or out of bounds) and return a sensible psprintf
message like "<behaviour>: tuple discarded (exception log unavailable)" using
get_exception_behaviour_name(), otherwise proceed to access SpockExceptionLog as
before; reference symbols: discard_collateral_message, exception_log_ptr,
my_exception_log_index, SpockExceptionLog, get_exception_behaviour_name().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/spock_apply.c`:
- Around line 1213-1236: The function discard_collateral_message reads
exception_log_ptr[my_exception_log_index] without guarding that
exception_log_ptr is non-NULL or that my_exception_log_index is valid; add a
defensive check at the top of discard_collateral_message to detect
exception_log_ptr == NULL or my_exception_log_index < 0 (or out of bounds) and
return a sensible psprintf message like "<behaviour>: tuple discarded (exception
log unavailable)" using get_exception_behaviour_name(), otherwise proceed to
access SpockExceptionLog as before; reference symbols:
discard_collateral_message, exception_log_ptr, my_exception_log_index,
SpockExceptionLog, get_exception_behaviour_name().

In `@tests/tap/t/020_exception_root_cause.pl`:
- Around line 37-39: The fixed sleep(2) after psql_or_bail calls can be flaky;
replace it with a robust wait that polls the cluster until the new setting is
visible: after psql_or_bail(2, "SELECT pg_reload_conf()"), loop calling
psql_or_bail or similar to run "SHOW spock.exception_behaviour" and break only
when it returns "transdiscard" (with a short sleep/backoff and an overall
timeout), or alternatively increase sleep(2) to a larger value like sleep(5) if
you prefer the simpler approach; update references in this test around
psql_or_bail, pg_reload_conf and sleep to implement the polling and timeout
behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 600e23c1-813b-4b1f-ad3d-60492480d9a5

📥 Commits

Reviewing files that changed from the base of the PR and between 94f4ad6 and 7ea8acd.

📒 Files selected for processing (2)
  • src/spock_apply.c
  • tests/tap/t/020_exception_root_cause.pl

@rasifr rasifr requested a review from danolivo June 10, 2026 11:35

@danolivo danolivo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. handle_commit increments this counter too. What if we fail somewhere in the middle of a commit, due to a deferred unique or FK constraint, as an example? Not sure if anything would go wrong (except incorrect reference in the exceptions table), but there should be a regression test to verify it is OK.
  2. discard_collateral_message() dereferences exception_log_ptr[my_exception_log_index] with no guard; the SQL caller asserts the index, but the DML caller doesn't. An Assert(exception_log_ptr != NULL && my_exception_log_index >= 0) in the helper would match the paranoia level of the surrounding code.
  3. "tuple discarded" - Does this make sense in the case of SQL path (DDL command)? What tuple has been discarded in such a case?

rasifr and others added 2 commits June 10, 2026 21:02
… a row

TRANSDISCARD/SUB_DISABLE replay records the real error on the row that
failed and points the other discarded rows at it via command_counter
(cheaper than repeating the message on every row). When the failure is not
attributable to a specific row, failed_action is 0 and that pointer dangles
-- no row's counter is 0 -- so the captured cause was lost, e.g. "tuple
discarded due to exception at command_counter 0".

Extract the collateral-discard message into discard_collateral_message(),
shared by the DML and SQL replay paths, and surface the captured
initial_error_message when failed_action is 0 instead of an empty pointer.
The helper asserts the exception-log index is valid and names the discarded
entry ("tuple" vs "statement") so the SQL/DDL path does not claim a tuple
was discarded.

A failure during COMMIT (e.g. a deferred constraint trigger) is likewise
not attributable to any replayed row: handle_commit has already bumped the
action counter, so the command_counter pointer dangled. Treat it as
non-attributable so the captured root cause is surfaced there too.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A two-row transaction fails on its second row (subscriber ALWAYS BEFORE
INSERT trigger RAISEs). Asserts the failing row carries the real error and
the non-failing row points at it via command_counter (TRANSDISCARD).

Also covers the commit-time discard: a DEFERRABLE INITIALLY DEFERRED
constraint trigger (ENABLE ALWAYS) that RAISEs at COMMIT exercises the
non-row-attributable path and asserts the root cause is surfaced, the
transaction is discarded, and apply recovers. Polls for the applied GUC
instead of a fixed sleep.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rasifr rasifr force-pushed the fix/exception-log-surface-root-cause branch from 7ea8acd to 6c32275 Compare June 10, 2026 16:03

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/tap/t/020_exception_root_cause.pl (2)

20-24: ⚡ Quick win

Update stale scenario comments to match the actual assertions.

Line 23 says the non-failing row should contain the real RAISE text, but the test asserts a command_counter pointer for that row. Also, the Scenario B header still says “deferred unique” while the setup uses a deferred constraint trigger.

✏️ Suggested doc/comment fix
-# path. Its exception_log entry must contain the real RAISE text.
+# path. Its exception_log entry should point to the failing row via
+# command_counter, while the failing row carries the real RAISE text.
@@
-# Scenario B: commit-time failure (deferred unique), not attributable to a row
+# Scenario B: commit-time failure (deferred constraint trigger), not attributable to a row

Also applies to: 146-147

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/tap/t/020_exception_root_cause.pl` around lines 20 - 24, Update the
stale scenario comments in tests/tap/t/020_exception_root_cause.pl to match the
actual assertions: change the Scenario A comment that currently says the
non-failing row "should contain the real RAISE text" to instead say the test
asserts a command_counter pointer for that row (errmsg==NULL path), and update
the Scenario B header text from "deferred unique" to "deferred constraint
trigger"; apply the same comment corrections where duplicated later (lines
corresponding to the second occurrence around 146-147).

167-168: ⚡ Quick win

Harden the commit-time assertion by explicitly rejecting dangling pointer text.

The current like check validates root-cause presence, but won’t fail if command_counter 0 also appears. Add a negative assertion here.

✅ Suggested assertion hardening
 like($defcon_msg, qr/deferredboom/,
     'commit-time failure surfaces the captured root cause (not a dangling command_counter)');
+unlike($defcon_msg, qr/command_counter\s+0/,
+    'commit-time failure message does not contain a dangling command_counter 0 pointer');
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/tap/t/020_exception_root_cause.pl` around lines 167 - 168, The existing
test asserts $defcon_msg matches /deferredboom/ but may still pass if it also
contains a dangling pointer string like "command_counter 0"; update the test to
harden it by adding an explicit negative assertion that $defcon_msg does NOT
match /command_counter\s*0/ (use the test harness's unlike/unmatched assertion
with a clear failure message) so the commit-time failure must show the captured
root cause and not a dangling command_counter.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/tap/t/020_exception_root_cause.pl`:
- Around line 20-24: Update the stale scenario comments in
tests/tap/t/020_exception_root_cause.pl to match the actual assertions: change
the Scenario A comment that currently says the non-failing row "should contain
the real RAISE text" to instead say the test asserts a command_counter pointer
for that row (errmsg==NULL path), and update the Scenario B header text from
"deferred unique" to "deferred constraint trigger"; apply the same comment
corrections where duplicated later (lines corresponding to the second occurrence
around 146-147).
- Around line 167-168: The existing test asserts $defcon_msg matches
/deferredboom/ but may still pass if it also contains a dangling pointer string
like "command_counter 0"; update the test to harden it by adding an explicit
negative assertion that $defcon_msg does NOT match /command_counter\s*0/ (use
the test harness's unlike/unmatched assertion with a clear failure message) so
the commit-time failure must show the captured root cause and not a dangling
command_counter.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b34791cb-dcc4-4629-9c92-b4c9870ab02d

📥 Commits

Reviewing files that changed from the base of the PR and between 7ea8acd and 6c32275.

📒 Files selected for processing (2)
  • src/spock_apply.c
  • tests/tap/t/020_exception_root_cause.pl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants