Skip to content

DAOS-19016 test: Stale event pointer dereference in autotest kv_put/kv_get spin loops#18489

Draft
knard38 wants to merge 2 commits into
masterfrom
ckochhof/fix/master/daos-19016/patch-001
Draft

DAOS-19016 test: Stale event pointer dereference in autotest kv_put/kv_get spin loops#18489
knard38 wants to merge 2 commits into
masterfrom
ckochhof/fix/master/daos-19016/patch-001

Conversation

@knard38

@knard38 knard38 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Description

TODO

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

kanard38 added 2 commits June 12, 2026 09:33
…et loops

The kv_put() and kv_get() functions in src/utils/daos_autotest.c have a
latent bug: when daos_eq_poll() returns a negative error code the event
pointer evp is not populated, yet the code unconditionally dereferences
evp->ev_error on the next line.  This causes a SIGSEGV, event state
corruption, or double submission.

Fix:
- Initialize evp = NULL before each spin loop so that the stale-pointer
  condition is always detectable.
- Break out of the loop when rc < 0 so evp is never dereferenced after a
  poll failure.
- Add D_ASSERT(evp != NULL) after each loop to catch future regressions.
- In the kv_put() drain loop, capture ev_error for completions that arrive
  during a concurrent poll failure.

To facilitate testing, add fault injection point DAOS_FAULT_EQ_POLL_FAIL
(DAOS_FAIL_SYS_TEST_GROUP_LOC | 0x1000, decimal 135168) in daos_eq_poll().
When triggered it returns -DER_HG, simulating a transient Mercury transport
error without needing a real network failure.

Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
…ndling

Add a new pool functional test PoolAutotestEqPollFITest that verifies the
fix for the stale event pointer dereference in the kv_put() / kv_get()
spin loops of src/utils/daos_autotest.c (DAOS-19016).

The test enables fault injection point DAOS_FAULT_EQ_POLL_FAIL (ID 135168)
via the YAML fault_list section.  This causes daos_eq_poll() to return
-DER_HG, exercising the rc < 0 break added by the fix.

Verification:
  - daos pool autotest exits with rc == 1 (clean failure, no crash)
  - DER_HG(-1020) appears in the stderr output
  - the pool remains healthy after the expected autotest failure

Quick-Functional: true
Test-tag: test_pool_autotest_eq_poll_fi,PoolAutotestEqPollFITest
Test-repeat: 5
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
@github-actions

Copy link
Copy Markdown

Ticket title is 'Stale event pointer dereference in autotest kv_put/kv_get spin loops'
Status is 'In Progress'
Errors are Title of PR is too long
https://daosio.atlassian.net/browse/DAOS-19016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants