New Failover: Port LDAP provider to the new failover by justin-stephenson · Pull Request #8751 · SSSD/sssd

justin-stephenson · 2026-05-29T16:38:24Z

Includes pre-requisite changes to refactor backend to use only one return code instead of 3
LDAP Servers are hard-coded which will be fixed later
Disables building AD/IPA provider and excludes AD/IPA system tests - will be handled in later PRs to port these providers

Assisted by: Claude code (Sonnet 4.6)

In low level ldap search functions

The data provider handler methods now return a single output argument. Remove 'dp_error/dp_err' and 'error_message' usage across provider code. The getAccountDomain method still needs to return 'domain_name' string. Assisted by: Claude code (Sonnet 4.6)

gemini-code-assist

Code Review

This pull request refactors SSSD's LDAP and AD providers by replacing the legacy sdap_id_op connection cache and retry logic with a new failover transaction mechanism (sss_failover_ctx). It removes the sdap_id_op files and the sssd-minimal backend, updating all dependent providers to propagate standard error codes directly. The review feedback highlights several critical issues: test LDAP servers are accidentally hardcoded in the production initialization path; returning EINVAL on connection failures in auth_connect_done and sdap_access_filter_connect_done masks actual errors and bypasses offline fallback logic; passing NULL instead of a valid failover context to sdap_ad_resolve_sids_send will cause a crash; and commenting out the assignment to state->search_bases leaves it uninitialized.

gemini-code-assist · 2026-05-29T16:42:54Z

+    server = sss_failover_server_new(fctx, "fake_1.ldap.test",
+                                     "ldap://fake_1.ldap.test", 389, 1, 1);


The LDAP servers ('fake_1.ldap.test', 'fake_2.ldap.test', 'master.ldap.test') are hardcoded in sssm_ldap_init_failover. This appears to be test/mock code that was accidentally left in the production LDAP provider initialization. In production, the failover servers must be dynamically discovered or initialized from the SSSD configuration (e.g., ldap_uri and ldap_backup_uri options in opts), rather than being hardcoded to these test domains.

WARNING Server names are hardcoded!

sssd-bot

Review done using Claude Code / claude-opus-4-6

Functional Issues

Missing return after tevent_req_error in users_get_done

src/providers/ldap/ldap_id.c:531-532 — After calling tevent_req_error(req, ERR_SERVER_FAILURE), execution falls through into the ENOENT/RFC2307 fallback logic and eventually hits a second tevent_req_error(req, ret) at line 576. Calling tevent_req_error on an already-finished request is undefined behavior in tevent (no-op in some versions, assertion failure in others). Add return; after line 532.

Missing return after tevent_req_error in groups_get_done

src/providers/ldap/ldap_id.c:899-902 — Same pattern. After tevent_req_error(req, ERR_SERVER_FAILURE) at line 901, execution falls through past the ENOENT/MPG checks and reaches tevent_req_done(req) at line 940, calling done on an already-errored request.

NULL fctx passed to sdap_ad_resolve_sids_send

src/providers/ldap/sdap_async_initgroups_ad.c:1073-1074 — sdap_ad_resolve_sids_send is called with NULL as the fctx parameter. This NULL is stored in state->fctx (line 290) and eventually passed to groups_get_send → sss_failover_transaction_send, causing a NULL dereference. This code path is triggered by AD tokengroups-based initgroups lookups (guarded by SDAP_SCHEMA_AD at the call site). Should pass state->id_ctx->fctx.

Uninitialized subreq used in sdap_ad_get_domain_local_groups_send

src/providers/ldap/sdap_async_initgroups_ad.c:1224-1237 — The sdap_id_op_connect_send call that initialized subreq is inside a /* ... */ comment block, but tevent_req_set_callback(subreq, ...) at line 1236 still uses the uninitialized pointer. Additionally, state->search_bases assignment is commented out at line 1212, leaving it NULL. This function is reachable via sdap_ad_check_domain_local_groups during initgroups for AD subdomain users (guard at sdap_async_initgroups.c:3161-3163). Both the uninitialized subreq and NULL search_bases will crash.

Pre-transaction access to fctx->active_server->uri

src/providers/ldap/ldap_auth.c:788 — ldap_is_ldapi_url(state->ctx->fctx->active_server->uri) is called before sss_failover_transaction_ex_send at line 793. On the first authentication attempt or after a server rotation where active_server hasn't been set yet, this dereferences NULL. The LDAPI check should be moved inside auth_connect_done where the connection is established, or the code should guard against active_server == NULL.

sdap_access_filter_done loses ERR_INVALID_FILTER handling

src/providers/ldap/sdap_access.c:1004-1005 — The old code detected specific LDAP errors from sdap_get_generic_recv (e.g., malformed filter) and mapped them to ERR_ACCESS_DENIED, which is the safe default (deny access on bad filter). The new code maps all errors to ERR_SERVER_FAILURE, which triggers retry across all servers. A malformed access filter will fail identically on every server, eventually putting the client offline instead of denying access — a security-relevant behavior change.

Silent error fallthrough in sdap_autofs_enumerate_done

src/providers/ldap/sdap_autofs.c:136-158 — After removing the retry logic, errors from sdap_autofs_setautomntent_recv that are neither EOK nor ENOENT fall through to tevent_req_done(req) at line 158, treating failures as success. The if (ret == ENOENT) block handles ENOENT specifically, but any other error (e.g., ERR_SERVER_FAILURE, ENOMEM) silently succeeds.

Inconsistent ERR_OFFLINE handling in enumeration

src/providers/ldap/sdap_async_enum.c:147 and sdap_async_enum.c:258 handle ERR_OFFLINE by calling tevent_req_done (graceful degradation) in users_done and services_done respectively. But sdap_dom_enum_ex_groups_done at line 206 does not check for ERR_OFFLINE — if the server goes offline between user and group enumeration, groups will fail the entire enum cycle while users succeeded gracefully.

Enumeration creates separate failover transactions per entity type

src/providers/ldap/sdap_async_enum.c:95,158,213 — Three independent sss_failover_transaction_send calls are made for users, groups, and services within a single enumeration cycle. While the default reuse_connection=true means the same connection is reused when the server is stable, if a server fails between phases, different entity types could be enumerated from different servers, producing inconsistent cache state. The old code shared a single connection for the entire enumeration.

Nits & Non-functional Issues

no_mpg_user_fallback in generic failover layer

src/providers/failover/failover.h:105 and src/providers/failover/ldap/failover_ldap.h:33 — This boolean controls LDAP-specific MPG (magic private groups) user fallback behavior. It appears in both the generic sss_failover_ctx and the LDAP-specific sss_failover_ldap_connection. The generic failover layer should not carry provider-specific flags. Consider moving this to the LDAP layer or passing it through sdap_options.

Blanket ERR_SERVER_FAILURE mapping without diagnostic logging

Multiple files (ldap_id.c:532,900, ldap_id_netgroup.c:162, ldap_id_services.c:202, sdap_hostid.c, sdap_iphost.c, sdap_ipnetwork.c, sdap_async_autofs.c) map all non-EOK/non-ENOENT errors to ERR_SERVER_FAILURE without logging the original error code. This makes production debugging difficult since ENOMEM, EINVAL, permission errors, and actual server failures become indistinguishable in logs. Consider adding DEBUG(SSSDBG_OP_FAILURE, "... failed: %d: %s\n", ret, sss_strerror(ret)) before the mapping.

Commented-out code left in sdap_async_groups.c

src/providers/ldap/sdap_async_groups.c:1730,1747-1749,1781-1820 — Multiple commented-out declarations, a FIXME comment, and a commented-out function (sdap_get_groups_ldap_connect_done). These are AD GC-to-LDAP connection fallback stubs. If AD porting is deferred to a later PR, these should either be removed with a TODO comment referencing the follow-up work, or left as proper #if 0 blocks with explanation. C-style // and /* */ comments for disabled code are harder to track.

Commented-out code in sdap_async_initgroups_ad.c

src/providers/ldap/sdap_async_initgroups_ad.c:1212,1224-1235 — Same issue as above: the search_bases assignment and sdap_id_op connect block are commented out rather than properly stubbed or removed. Unlike the groups case, this code is reachable (see functional issue above).

Removal of GC connection preference for AD initgroups

src/providers/ldap/sdap_async_initgroups.c and sdap_async_groups.c — The old get_ldap_conn_from_sdom_pvt logic provided LDAP-over-Global-Catalog for AD group lookups when available. This is commented out with a FIXME: AD use case note. When AD is ported in the follow-up PR, this optimization should be restored — without it, AD initgroups in multi-domain forests queries individual domain controllers instead of the GC, significantly increasing latency.

Intermediate commits do not compile independently

Commit 1fcaec049 ("Port LDAP provider code to new failover") references sdap_service in sdap_id_ctx_new (removed from the function signature in the same commit), causing a build failure. The fix is in commit a5dae92c0 ("Remove unused / old failover code"). For bisectability, the fix should be folded into the earlier commit.

Unrelated fix mixed into failover port

src/providers/ldap/ldap_id_services.c:220-221 — The strtouint16 error check removes the errno test, which is an unrelated bug fix. It should be a separate commit for clean history.

Review of Existing Review Comments

gemini-code-assist: Hardcoded LDAP servers in ldap_init.c:150-151

This is in commit f4c825416 ("DO NOT PUSH TO MASTER ldap: Initialization for new failover") which is explicitly a temporary scaffolding commit not intended for merge. The hardcoded servers are expected and not a concern for this review.

gemini-code-assist: EINVAL on NULL connection masks errors (ldap_auth.c:838-846)

Partially valid. The EINVAL on NULL connection is the standard pattern used across all handlers in this PR (every connected_recv NULL check returns EINVAL). The specific concern about auth offline detection is addressed differently in the new model — the failover transaction layer handles offline/retry, not the individual handlers. However, the pattern does lose the original error code. The sss_failover_transaction_connected_recv returning NULL should be treated as a fatal internal error (hence EINVAL is reasonable), but a DEBUG log of the underlying cause would help.

gemini-code-assist: Offline fallback bypassed in sdap_access.c:956-964

This is about sdap_access_ppolicy_connect_done. The concern about losing sdap_access_decide_offline is valid in the old architecture where handlers had to decide offline behavior themselves. In the new architecture, the failover transaction layer handles the offline case before the handler is called — if all servers fail, the transaction returns ERR_NO_MORE_SERVERS rather than calling the connected callback at all. So the offline fallback is handled at a different layer, not lost. However, the ERR_INVALID_FILTER concern I raised in functional issues above is a separate, valid problem.

gemini-code-assist: NULL fctx in sdap_ad_resolve_sids_send (sdap_async_initgroups_ad.c:1073-1074)

Valid. Confirmed as a crash bug — see functional issues above.

gemini-code-assist: Commented-out search_bases (sdap_async_initgroups_ad.c:1212)

Valid. Combined with the commented-out connect block below it, this creates an uninitialized subreq usage — see functional issues above.

CodeQL: Commented-out code in ldap_id.c and sdap_async_groups.c

Valid housekeeping issues. The commented-out code should be removed or converted to proper stubs with FIXME/TODO annotations referencing the AD porting PR.

This will be done differently inside the failover code

Done as part of phased approach to port providers to new Failover

test_logging__offline_errors_are_written_to_logs_and_syslog: asserted offline logs messages are only part of old failover test_autofs__propagate_offline_status_*: asserted offline log messages are only part of the old failover test_logging__dns_resolution_issue_in_logs: asserted log messages are part of the old failover code test_failover: Needs to be updated for new failover implementation test_ad__user_authentication_when_provider_is_set_to_ldap_with_gss_spnego: 'id_provider = ldap' is hardcoded to go to 'master.ldap.test' but AD server should be used here test_ipa__subids_configured[id_provider=ldap]: 'id_provider = ldap' is hardcoded to go to 'master.ldap.test' but IPA server should be used here test_multithreaded_pac_client: requires re-using cached connection, not yet implemented in new failover test_ldap_krb5__keytab_selects_correct_principal_with_multiple_realms: Hybrid configuration issues to be fixed after hardcoding is not needed.

Temp changes to allow building, only used in AD provider codepaths. This code will be fixed correctly when the AD provider is ported.

No longer needed now that SSSD providers are being ported

Temp changes to allow building, only used in IPA provider codepaths. This code will be fixed correctly when the IPA provider is ported.

justin-stephenson · 2026-06-09T15:53:39Z

Review done using Claude Code / claude-opus-4-6

Functional Issues

Missing return after tevent_req_error in users_get_done

src/providers/ldap/ldap_id.c:531-532 — After calling tevent_req_error(req, ERR_SERVER_FAILURE), execution falls through into the ENOENT/RFC2307 fallback logic and eventually hits a second tevent_req_error(req, ret) at line 576. Calling tevent_req_error on an already-finished request is undefined behavior in tevent (no-op in some versions, assertion failure in others). Add return; after line 532.

Fixed.

Missing return after tevent_req_error in groups_get_done

src/providers/ldap/ldap_id.c:899-902 — Same pattern. After tevent_req_error(req, ERR_SERVER_FAILURE) at line 901, execution falls through past the ENOENT/MPG checks and reaches tevent_req_done(req) at line 940, calling done on an already-errored request.

Fixed.

NULL fctx passed to sdap_ad_resolve_sids_send

src/providers/ldap/sdap_async_initgroups_ad.c:1073-1074 — sdap_ad_resolve_sids_send is called with NULL as the fctx parameter. This NULL is stored in state->fctx (line 290) and eventually passed to groups_get_send → sss_failover_transaction_send, causing a NULL dereference. This code path is triggered by AD tokengroups-based initgroups lookups (guarded by SDAP_SCHEMA_AD at the call site). Should pass state->id_ctx->fctx.

Correct, this will be fixed later with AD provider porting changes.

Uninitialized subreq used in sdap_ad_get_domain_local_groups_send

src/providers/ldap/sdap_async_initgroups_ad.c:1224-1237 — The sdap_id_op_connect_send call that initialized subreq is inside a /* ... */ comment block, but tevent_req_set_callback(subreq, ...) at line 1236 still uses the uninitialized pointer. Additionally, state->search_bases assignment is commented out at line 1212, leaving it NULL. This function is reachable via sdap_ad_check_domain_local_groups during initgroups for AD subdomain users (guard at sdap_async_initgroups.c:3161-3163). Both the uninitialized subreq and NULL search_bases will crash.

Correct, this will be fixed later with AD provider porting changes.

Pre-transaction access to fctx->active_server->uri

src/providers/ldap/ldap_auth.c:788 — ldap_is_ldapi_url(state->ctx->fctx->active_server->uri) is called before sss_failover_transaction_ex_send at line 793. On the first authentication attempt or after a server rotation where active_server hasn't been set yet, this dereferences NULL. The LDAPI check should be moved inside auth_connect_done where the connection is established, or the code should guard against active_server == NULL.

This was checking state->sdap_service->uri but now sdap_service struct is removed. Maybe this check can be removed? ldap_is_ldapi_url(state->conn->uri) is also being checked in auth_connect_done()

sdap_access_filter_done loses ERR_INVALID_FILTER handling

src/providers/ldap/sdap_access.c:1004-1005 — The old code detected specific LDAP errors from sdap_get_generic_recv (e.g., malformed filter) and mapped them to ERR_ACCESS_DENIED, which is the safe default (deny access on bad filter). The new code maps all errors to ERR_SERVER_FAILURE, which triggers retry across all servers. A malformed access filter will fail identically on every server, eventually putting the client offline instead of denying access — a security-relevant behavior change.

Fixed, added handling of ERR_INVALID_FILTER

Silent error fallthrough in sdap_autofs_enumerate_done

src/providers/ldap/sdap_autofs.c:136-158 — After removing the retry logic, errors from sdap_autofs_setautomntent_recv that are neither EOK nor ENOENT fall through to tevent_req_done(req) at line 158, treating failures as success. The if (ret == ENOENT) block handles ENOENT specifically, but any other error (e.g., ERR_SERVER_FAILURE, ENOMEM) silently succeeds.

Fixed, added check for other failures.

Inconsistent ERR_OFFLINE handling in enumeration

src/providers/ldap/sdap_async_enum.c:147 and sdap_async_enum.c:258 handle ERR_OFFLINE by calling tevent_req_done (graceful degradation) in users_done and services_done respectively. But sdap_dom_enum_ex_groups_done at line 206 does not check for ERR_OFFLINE — if the server goes offline between user and group enumeration, groups will fail the entire enum cycle while users succeeded gracefully.

Fixed, added check for if ret == ERR_OFFLINE

Enumeration creates separate failover transactions per entity type

src/providers/ldap/sdap_async_enum.c:95,158,213 — Three independent sss_failover_transaction_send calls are made for users, groups, and services within a single enumeration cycle. While the default reuse_connection=true means the same connection is reused when the server is stable, if a server fails between phases, different entity types could be enumerated from different servers, producing inconsistent cache state. The old code shared a single connection for the entire enumeration.

Old code uses conn_cache, this still needs to be implemented in the new failover

justin-stephenson added 3 commits May 29, 2026 09:41

dp: Remove sdap_ret and sdap_error

9ef9c1e

Assisted by: Claude code (Sonnet 4.6)

ldap: Replace EIO/ETIMEDOUT with ERR_SERVER_FAILURE

b50cd70

In low level ldap search functions

gemini-code-assist Bot reviewed May 29, 2026

View reviewed changes

github-advanced-security AI found potential problems May 29, 2026

View reviewed changes

justin-stephenson force-pushed the ldap_port_new_fo_branch branch 4 times, most recently from d28311b to b2f0de6 Compare June 2, 2026 16:38

DO NOT PUSH TO MASTER ldap: Initialization for new failover

f4c8254

WARNING Server names are hardcoded!

justin-stephenson force-pushed the ldap_port_new_fo_branch branch 2 times, most recently from af4faaa to ededa84 Compare June 3, 2026 15:12

justin-stephenson marked this pull request as ready for review June 3, 2026 17:12

justin-stephenson added no-backport This should go to target branch only. Waiting for review labels Jun 3, 2026

alexey-tikhonov self-requested a review June 4, 2026 12:38

alexey-tikhonov self-assigned this Jun 4, 2026

alexey-tikhonov requested a review from pbrezina June 4, 2026 12:38

alexey-tikhonov assigned pbrezina Jun 4, 2026

sssd-bot reviewed Jun 9, 2026

View reviewed changes

justin-stephenson added 10 commits June 9, 2026 11:52

Port LDAP provider code to new failover

4cdcdd8

ldap: Remove sdap_online_check.c

9ffa0ac

This will be done differently inside the failover code

ldap: Remove unused / old failover code

b7f5dc3

ldap: Remove sdap_id_op.c/h files

10144ff

DO NOT PUSH TO MASTER ci: Exclude IPA + Samba + AD System tests

a639014

Done as part of phased approach to port providers to new Failover

DO NOT PUSH TO MASTER ipa: Disable building IPA provider

5438c8c

DO NOT PUSH TO MASTER ad: Disable building AD provider

48e2038

DO NOT PUSH TO MASTER ad: Placeholder changes of AD code

1529ab1

Temp changes to allow building, only used in AD provider codepaths. This code will be fixed correctly when the AD provider is ported.

DO NOT PUSH TO MASTER minimal: Remove minimal provider entirely

5df05d7

No longer needed now that SSSD providers are being ported

DO NOT PUSH TO MASTER ipa: Placeholder changes of IPA code

f6e981d

Temp changes to allow building, only used in IPA provider codepaths. This code will be fixed correctly when the IPA provider is ported.

justin-stephenson force-pushed the ldap_port_new_fo_branch branch from ededa84 to f6e981d Compare June 9, 2026 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Failover: Port LDAP provider to the new failover#8751

New Failover: Port LDAP provider to the new failover#8751
justin-stephenson wants to merge 15 commits into
SSSD:failoverfrom
justin-stephenson:ldap_port_new_fo_branch

justin-stephenson commented May 29, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sssd-bot left a comment

Uh oh!

justin-stephenson commented Jun 9, 2026

Functional Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		server = sss_failover_server_new(fctx, "fake_1.ldap.test",
		"ldap://fake_1.ldap.test", 389, 1, 1);

Conversation

justin-stephenson commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sssd-bot left a comment

Choose a reason for hiding this comment

Functional Issues

Nits & Non-functional Issues

Review of Existing Review Comments

Uh oh!

justin-stephenson commented Jun 9, 2026

Functional Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

justin-stephenson commented May 29, 2026 •

edited

Loading