Parallelize per-dag auth checks in KeycloakAuthManager#69107
Parallelize per-dag auth checks in KeycloakAuthManager#69107Andrushika wants to merge 2 commits into
Conversation
SameerMesiah97
left a comment
There was a problem hiding this comment.
Looks fine to me. I have left 2 comments. Also, I believe you should expand test coverage (either by adding new unit tests or modifying exist ones) to include the new concurrency logic. In particular, it would be good to verify that ThreadPoolExecutor is created with the expected max_workers value (covering both len(dag_ids) and the configured pool size), while ensuring the existing authorization behavior is preserved in both the single- and multi-worker cases.
| details_kwargs: dict[str, Any] = {"id": dag_id} | ||
| if team_name is not None: | ||
| details_kwargs["team_name"] = team_name | ||
| return dag_id, self.is_authorized_dag( |
There was a problem hiding this comment.
Have you looked at whether is_authorized_dag() or the underlying HTTP client becomes the limiting factor here? It would be good to understand whether increasing the thread count beyond the HTTP connection pool size actually improves throughput.
| return set() | ||
|
|
||
| max_workers = min( | ||
| len(dag_ids), conf.getint(CONF_SECTION_NAME, CONF_REQUESTS_POOL_SIZE_KEY, fallback=10) |
There was a problem hiding this comment.
Using the number of DAGs as the upper bound for the number of threads makes sense. But could you clarify the rationale behind using CONF_REQUESTS_POOL_SIZE_KEY? Correct me if I am wrong but I believe you are using it because it is the limit for the size of the HTTP connection pool for Keycloak requests. It would not hurt to add a comment here to explain why you are using these 2 variables to determine max_workers.
/dagsscreen is very slow to load with multiple teams #69041As the issue reported, users who use Keycloak as an auth backend are facing over 10 seconds of UI loading speed.
I took a look and discovered that
KeycloakAuthManager.filter_authorized_dag_ids()wraps the result with cache/single-flight, but on cache miss it falls back to the base auth manager implementation. The base implementation checks each Dag ID individually, so Keycloak sends one authorization HTTP request per Dag ID during that filter call.To address this problem, I use
ThreadPoolExecutorto run the per-dag checks concurrently.I tried to simulate and wrote a script for a local benchmark. Here's the result:
(50ms latency per request, measured with
perf_kit.repeat_and_time.timing)The test script is as following:
Details
However, I found it hard to reproduce the production environment as described in #69041... I would really appreciate it if someone could help check if this really solved the problem.
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Code Opus 4.8 following the guidelines, reviewed by @Andrushika.
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.