Recover disk-space metrics when a cached FileStore's directory is removed during region migration#17931
Recover disk-space metrics when a cached FileStore's directory is removed during region migration#17931CRZbulabula wants to merge 1 commit into
Conversation
…emoved A cached FileStore pins the exact path it was resolved from. When that path is deleted while IoTDB is running (e.g. an empty data region directory removed during region migration), every disk-space query against the stale FileStore throws NoSuchFileException, which was logged at ERROR on every heartbeat and flooded the DataNode log. Store the configured disk dirs and, when a space query fails, re-resolve the FileStores once via FileStoreUtils#getFileStore (which walks up to an existing ancestor on the same device) so the metric recovers on the next sampling. Remaining failures are logged at WARN instead of ERROR.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #17931 +/- ##
============================================
+ Coverage 41.07% 41.08% +0.01%
- Complexity 318 333 +15
============================================
Files 5257 5257
Lines 365010 365010
Branches 47180 47180
============================================
+ Hits 149918 149970 +52
+ Misses 215092 215040 -52 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
I found two issues that should be addressed before merging:
I verified the new test locally on the PR head with:
The test passed, including checkstyle and spotless in that Maven run. The remaining blocker I see is the SonarCloud failure above. |




Description
This is a follow-up to #17880 ("Fix empty snapshot loading and region cleanup"), addressing the second problem reported in the same scenario: a cluster that contains an empty
DataRegion(auto-created by the ConfigNode after a scale-out, carrying0SeriesPartitionSlot) being migrated during scale operations.While #17880 fixed the empty-snapshot loading (
SnapshotLoader) and the region-cleanup timeout (TableDiskUsageIndex/DataRegion), the affected DataNode kept flooding its log withERRORentries like:Root cause
SystemMetrics#setDiskDirsresolves each configured disk directory into ajava.nio.file.FileStoreonce at startup and caches the resulting objects. AFileStorepins the exact path it was resolved from; on Linux everygetTotalSpace()/getUnallocatedSpace()/getUsableSpace()call re-runsstatvfson that pinned path.When that directory is removed while IoTDB is running (e.g. an empty data region directory is deleted during region migration), the pinned path no longer exists and every space query throws
NoSuchFileException. Because disk metrics are sampled on every DataNode heartbeat (and on every Prometheus scrape), the staleFileStorewas logged atERRORon every sampling, never recovered, and flooded the log.Fix
SystemMetricsnow also stores the configured disk dirs.FileStorefails, theFileStoreset is re-resolved once viaFileStoreUtils#getFileStore, which walks up to an existing ancestor directory on the same device. The metric then recovers on the next sampling instead of staying broken forever.WARNinstead ofERROR, so it can no longer flood the log.fileStores/diskDirsare madevolatileand the re-resolution is done copy-on-write, since the getters are invoked concurrently from the heartbeat and Prometheus-reporter threads.Behavior
0and spammingERRORlogs.PingCode: V2-974
This PR has:
Key changed/added classes (or packages if there are too many classes) in this PR
org.apache.iotdb.metrics.metricsets.system.SystemMetricsorg.apache.iotdb.metrics.metricsets.system.SystemMetricsTest(new)