Kafka Connect: make sink cleanup robust#16843
Conversation
|
|
||
| // Normal close: if leader partition is lost, stop coordinator. | ||
| if (hasLeaderPartition(closedPartitions)) { | ||
| boolean stopCoordinator = false; |
There was a problem hiding this comment.
This local boolean stopCoordinator shares its name with the stopCoordinator() method called a few lines below, so the branch reads as if (stopCoordinator) { ... stopCoordinator(); }. Rename the flag to a predicate such as shouldStopCoordinator (or leaderPartitionLost) to separate the condition from the action it triggers.
There was a problem hiding this comment.
Thank you for pointing this out. I renamed the local flag to shouldStopCoordinator in c852c17 so the predicate is clearly separated from the stopCoordinator() action.
| } | ||
| } | ||
|
|
||
| private RuntimeException appendFailure(RuntimeException failure, RuntimeException next) { |
There was a problem hiding this comment.
This appendFailure (set-as-primary if none yet, else addSuppressed) is the same merge logic hand-inlined in the catch blocks of Channel.stop(), Worker.stop(), and CoordinatorThread.terminate() (two branches). All four classes live in org.apache.iceberg.connect.channel; consider promoting one package-private static helper and calling it from each so the suppressed-exception handling stays in one place.
There was a problem hiding this comment.
Thank you, that makes sense. I kept the follow-up small and moved the merge logic into a package-private static helper on the existing Channel class in c852c17, then reused it from CommitterImpl, Channel, Worker, and CoordinatorThread. That keeps the suppression handling in one place without adding another helper class.
Summary
This makes Iceberg Kafka Connect sink cleanup more robust when shutdown happens after an internal Kafka client or coordinator operation has already failed.
The change keeps the original failure visible to Kafka Connect, but it no longer lets that first failure skip the remaining cleanup steps. In particular:
CommitterImpl.close(...)fails while checking leader ownershipAdminClientfromChannel, avoiding one extra internal Kafka client per worker/coordinator channelWorkerchannel resources andSinkWriterindependentlyReport
We observed a connector deletion path where the REST delete succeeded, but task shutdown then failed in
CommitterImpl.close(...):After that, the connector was gone from the REST API, but internal Kafka clients using the connector-derived client id continued logging authentication/metadata errors. The practical failure mode is similar to the "zombie coordinator" class of bugs: a shutdown path hits an exception and leaves internal resources alive longer than intended.
Related Issues And PRs
This is related to, but not identical to, the existing Kafka Connect cleanup reports:
CommitterImpl.close(...).finallyblock. This PR includes the same cleanup direction, and also continues cleanup across worker/channel close failures and thehasLeaderPartition(...)failure path.I did not find an existing Apache Iceberg issue that specifically reports the connector-delete +
GroupAuthorizationException/Cannot retrieve members for consumer groupshutdown path.Tests