New Failover: Port LDAP provider to the new failover#8751
New Failover: Port LDAP provider to the new failover#8751justin-stephenson wants to merge 15 commits into
Conversation
Assisted by: Claude code (Sonnet 4.6)
In low level ldap search functions
The data provider handler methods now return a single output argument. Remove 'dp_error/dp_err' and 'error_message' usage across provider code. The getAccountDomain method still needs to return 'domain_name' string. Assisted by: Claude code (Sonnet 4.6)
There was a problem hiding this comment.
Code Review
This pull request refactors SSSD's LDAP and AD providers by replacing the legacy sdap_id_op connection cache and retry logic with a new failover transaction mechanism (sss_failover_ctx). It removes the sdap_id_op files and the sssd-minimal backend, updating all dependent providers to propagate standard error codes directly. The review feedback highlights several critical issues: test LDAP servers are accidentally hardcoded in the production initialization path; returning EINVAL on connection failures in auth_connect_done and sdap_access_filter_connect_done masks actual errors and bypasses offline fallback logic; passing NULL instead of a valid failover context to sdap_ad_resolve_sids_send will cause a crash; and commenting out the assignment to state->search_bases leaves it uninitialized.
| server = sss_failover_server_new(fctx, "fake_1.ldap.test", | ||
| "ldap://fake_1.ldap.test", 389, 1, 1); |
There was a problem hiding this comment.
The LDAP servers ('fake_1.ldap.test', 'fake_2.ldap.test', 'master.ldap.test') are hardcoded in sssm_ldap_init_failover. This appears to be test/mock code that was accidentally left in the production LDAP provider initialization. In production, the failover servers must be dynamically discovered or initialized from the SSSD configuration (e.g., ldap_uri and ldap_backup_uri options in opts), rather than being hardcoded to these test domains.
d28311b to
b2f0de6
Compare
WARNING Server names are hardcoded!
af4faaa to
ededa84
Compare
sssd-bot
left a comment
There was a problem hiding this comment.
Review done using Claude Code / claude-opus-4-6
Functional Issues
Missing return after tevent_req_error in users_get_done
src/providers/ldap/ldap_id.c:531-532 — After calling tevent_req_error(req, ERR_SERVER_FAILURE), execution falls through into the ENOENT/RFC2307 fallback logic and eventually hits a second tevent_req_error(req, ret) at line 576. Calling tevent_req_error on an already-finished request is undefined behavior in tevent (no-op in some versions, assertion failure in others). Add return; after line 532.
Missing return after tevent_req_error in groups_get_done
src/providers/ldap/ldap_id.c:899-902 — Same pattern. After tevent_req_error(req, ERR_SERVER_FAILURE) at line 901, execution falls through past the ENOENT/MPG checks and reaches tevent_req_done(req) at line 940, calling done on an already-errored request.
NULL fctx passed to sdap_ad_resolve_sids_send
src/providers/ldap/sdap_async_initgroups_ad.c:1073-1074 — sdap_ad_resolve_sids_send is called with NULL as the fctx parameter. This NULL is stored in state->fctx (line 290) and eventually passed to groups_get_send → sss_failover_transaction_send, causing a NULL dereference. This code path is triggered by AD tokengroups-based initgroups lookups (guarded by SDAP_SCHEMA_AD at the call site). Should pass state->id_ctx->fctx.
Uninitialized subreq used in sdap_ad_get_domain_local_groups_send
src/providers/ldap/sdap_async_initgroups_ad.c:1224-1237 — The sdap_id_op_connect_send call that initialized subreq is inside a /* ... */ comment block, but tevent_req_set_callback(subreq, ...) at line 1236 still uses the uninitialized pointer. Additionally, state->search_bases assignment is commented out at line 1212, leaving it NULL. This function is reachable via sdap_ad_check_domain_local_groups during initgroups for AD subdomain users (guard at sdap_async_initgroups.c:3161-3163). Both the uninitialized subreq and NULL search_bases will crash.
Pre-transaction access to fctx->active_server->uri
src/providers/ldap/ldap_auth.c:788 — ldap_is_ldapi_url(state->ctx->fctx->active_server->uri) is called before sss_failover_transaction_ex_send at line 793. On the first authentication attempt or after a server rotation where active_server hasn't been set yet, this dereferences NULL. The LDAPI check should be moved inside auth_connect_done where the connection is established, or the code should guard against active_server == NULL.
sdap_access_filter_done loses ERR_INVALID_FILTER handling
src/providers/ldap/sdap_access.c:1004-1005 — The old code detected specific LDAP errors from sdap_get_generic_recv (e.g., malformed filter) and mapped them to ERR_ACCESS_DENIED, which is the safe default (deny access on bad filter). The new code maps all errors to ERR_SERVER_FAILURE, which triggers retry across all servers. A malformed access filter will fail identically on every server, eventually putting the client offline instead of denying access — a security-relevant behavior change.
Silent error fallthrough in sdap_autofs_enumerate_done
src/providers/ldap/sdap_autofs.c:136-158 — After removing the retry logic, errors from sdap_autofs_setautomntent_recv that are neither EOK nor ENOENT fall through to tevent_req_done(req) at line 158, treating failures as success. The if (ret == ENOENT) block handles ENOENT specifically, but any other error (e.g., ERR_SERVER_FAILURE, ENOMEM) silently succeeds.
Inconsistent ERR_OFFLINE handling in enumeration
src/providers/ldap/sdap_async_enum.c:147 and sdap_async_enum.c:258 handle ERR_OFFLINE by calling tevent_req_done (graceful degradation) in users_done and services_done respectively. But sdap_dom_enum_ex_groups_done at line 206 does not check for ERR_OFFLINE — if the server goes offline between user and group enumeration, groups will fail the entire enum cycle while users succeeded gracefully.
Enumeration creates separate failover transactions per entity type
src/providers/ldap/sdap_async_enum.c:95,158,213 — Three independent sss_failover_transaction_send calls are made for users, groups, and services within a single enumeration cycle. While the default reuse_connection=true means the same connection is reused when the server is stable, if a server fails between phases, different entity types could be enumerated from different servers, producing inconsistent cache state. The old code shared a single connection for the entire enumeration.
Nits & Non-functional Issues
no_mpg_user_fallback in generic failover layer
src/providers/failover/failover.h:105 and src/providers/failover/ldap/failover_ldap.h:33 — This boolean controls LDAP-specific MPG (magic private groups) user fallback behavior. It appears in both the generic sss_failover_ctx and the LDAP-specific sss_failover_ldap_connection. The generic failover layer should not carry provider-specific flags. Consider moving this to the LDAP layer or passing it through sdap_options.
Blanket ERR_SERVER_FAILURE mapping without diagnostic logging
Multiple files (ldap_id.c:532,900, ldap_id_netgroup.c:162, ldap_id_services.c:202, sdap_hostid.c, sdap_iphost.c, sdap_ipnetwork.c, sdap_async_autofs.c) map all non-EOK/non-ENOENT errors to ERR_SERVER_FAILURE without logging the original error code. This makes production debugging difficult since ENOMEM, EINVAL, permission errors, and actual server failures become indistinguishable in logs. Consider adding DEBUG(SSSDBG_OP_FAILURE, "... failed: %d: %s\n", ret, sss_strerror(ret)) before the mapping.
Commented-out code left in sdap_async_groups.c
src/providers/ldap/sdap_async_groups.c:1730,1747-1749,1781-1820 — Multiple commented-out declarations, a FIXME comment, and a commented-out function (sdap_get_groups_ldap_connect_done). These are AD GC-to-LDAP connection fallback stubs. If AD porting is deferred to a later PR, these should either be removed with a TODO comment referencing the follow-up work, or left as proper #if 0 blocks with explanation. C-style // and /* */ comments for disabled code are harder to track.
Commented-out code in sdap_async_initgroups_ad.c
src/providers/ldap/sdap_async_initgroups_ad.c:1212,1224-1235 — Same issue as above: the search_bases assignment and sdap_id_op connect block are commented out rather than properly stubbed or removed. Unlike the groups case, this code is reachable (see functional issue above).
Removal of GC connection preference for AD initgroups
src/providers/ldap/sdap_async_initgroups.c and sdap_async_groups.c — The old get_ldap_conn_from_sdom_pvt logic provided LDAP-over-Global-Catalog for AD group lookups when available. This is commented out with a FIXME: AD use case note. When AD is ported in the follow-up PR, this optimization should be restored — without it, AD initgroups in multi-domain forests queries individual domain controllers instead of the GC, significantly increasing latency.
Intermediate commits do not compile independently
Commit 1fcaec049 ("Port LDAP provider code to new failover") references sdap_service in sdap_id_ctx_new (removed from the function signature in the same commit), causing a build failure. The fix is in commit a5dae92c0 ("Remove unused / old failover code"). For bisectability, the fix should be folded into the earlier commit.
Unrelated fix mixed into failover port
src/providers/ldap/ldap_id_services.c:220-221 — The strtouint16 error check removes the errno test, which is an unrelated bug fix. It should be a separate commit for clean history.
Review of Existing Review Comments
gemini-code-assist: Hardcoded LDAP servers in ldap_init.c:150-151
This is in commit f4c825416 ("DO NOT PUSH TO MASTER ldap: Initialization for new failover") which is explicitly a temporary scaffolding commit not intended for merge. The hardcoded servers are expected and not a concern for this review.
gemini-code-assist: EINVAL on NULL connection masks errors (ldap_auth.c:838-846)
Partially valid. The EINVAL on NULL connection is the standard pattern used across all handlers in this PR (every connected_recv NULL check returns EINVAL). The specific concern about auth offline detection is addressed differently in the new model — the failover transaction layer handles offline/retry, not the individual handlers. However, the pattern does lose the original error code. The sss_failover_transaction_connected_recv returning NULL should be treated as a fatal internal error (hence EINVAL is reasonable), but a DEBUG log of the underlying cause would help.
gemini-code-assist: Offline fallback bypassed in sdap_access.c:956-964
This is about sdap_access_ppolicy_connect_done. The concern about losing sdap_access_decide_offline is valid in the old architecture where handlers had to decide offline behavior themselves. In the new architecture, the failover transaction layer handles the offline case before the handler is called — if all servers fail, the transaction returns ERR_NO_MORE_SERVERS rather than calling the connected callback at all. So the offline fallback is handled at a different layer, not lost. However, the ERR_INVALID_FILTER concern I raised in functional issues above is a separate, valid problem.
gemini-code-assist: NULL fctx in sdap_ad_resolve_sids_send (sdap_async_initgroups_ad.c:1073-1074)
Valid. Confirmed as a crash bug — see functional issues above.
gemini-code-assist: Commented-out search_bases (sdap_async_initgroups_ad.c:1212)
Valid. Combined with the commented-out connect block below it, this creates an uninitialized subreq usage — see functional issues above.
CodeQL: Commented-out code in ldap_id.c and sdap_async_groups.c
Valid housekeeping issues. The commented-out code should be removed or converted to proper stubs with FIXME/TODO annotations referencing the AD porting PR.
This will be done differently inside the failover code
Done as part of phased approach to port providers to new Failover
test_logging__offline_errors_are_written_to_logs_and_syslog: asserted offline logs messages are only part of old failover test_autofs__propagate_offline_status_*: asserted offline log messages are only part of the old failover test_logging__dns_resolution_issue_in_logs: asserted log messages are part of the old failover code test_failover: Needs to be updated for new failover implementation test_ad__user_authentication_when_provider_is_set_to_ldap_with_gss_spnego: 'id_provider = ldap' is hardcoded to go to 'master.ldap.test' but AD server should be used here test_ipa__subids_configured[id_provider=ldap]: 'id_provider = ldap' is hardcoded to go to 'master.ldap.test' but IPA server should be used here test_multithreaded_pac_client: requires re-using cached connection, not yet implemented in new failover test_ldap_krb5__keytab_selects_correct_principal_with_multiple_realms: Hybrid configuration issues to be fixed after hardcoding is not needed.
Temp changes to allow building, only used in AD provider codepaths. This code will be fixed correctly when the AD provider is ported.
No longer needed now that SSSD providers are being ported
Temp changes to allow building, only used in IPA provider codepaths. This code will be fixed correctly when the IPA provider is ported.
ededa84 to
f6e981d
Compare
Fixed.
Fixed.
Correct, this will be fixed later with AD provider porting changes.
Correct, this will be fixed later with AD provider porting changes.
This was checking
Fixed, added handling of
Fixed, added check for other failures.
Fixed, added check for if ret == ERR_OFFLINE
Old code uses conn_cache, this still needs to be implemented in the new failover |
Uh oh!
There was an error while loading. Please reload this page.