Recover disk-space metrics when a cached FileStore's directory is removed during region migration by CRZbulabula · Pull Request #17931 · apache/iotdb

CRZbulabula · 2026-06-12T08:34:27Z

Description

This is a follow-up to #17880 ("Fix empty snapshot loading and region cleanup"), addressing the second problem reported in the same scenario: a cluster that contains an empty DataRegion (auto-created by the ConfigNode after a scale-out, carrying 0 SeriesPartitionSlot) being migrated during scale operations.

While #17880 fixed the empty-snapshot loading (SnapshotLoader) and the region-cleanup timeout (TableDiskUsageIndex / DataRegion), the affected DataNode kept flooding its log with ERROR entries like:

ERROR o.a.i.m.m.s.SystemMetrics:366 - Failed to statistic the size of /data (/dev/sdb1), because
java.nio.file.NoSuchFileException: /data/.../data/datanode/system
	at ...
	at org.apache.iotdb.metrics.metricsets.system.SystemMetrics.getSystemDiskAvailableSpace(SystemMetrics.java:364)
	at org.apache.iotdb.metrics.core.type.IoTDBAutoGauge.getValue(IoTDBAutoGauge.java:43)
	at ...DataNodeInternalRPCServiceImpl.sampleDiskLoad(...)
	at ...getDataNodeHeartBeat(...)

Root cause

SystemMetrics#setDiskDirs resolves each configured disk directory into a java.nio.file.FileStore once at startup and caches the resulting objects. A FileStore pins the exact path it was resolved from; on Linux every getTotalSpace() / getUnallocatedSpace() / getUsableSpace() call re-runs statvfs on that pinned path.

When that directory is removed while IoTDB is running (e.g. an empty data region directory is deleted during region migration), the pinned path no longer exists and every space query throws NoSuchFileException. Because disk metrics are sampled on every DataNode heartbeat (and on every Prometheus scrape), the stale FileStore was logged at ERROR on every sampling, never recovered, and flooded the log.

Fix

SystemMetrics now also stores the configured disk dirs.
When a space query against a cached FileStore fails, the FileStore set is re-resolved once via FileStoreUtils#getFileStore, which walks up to an existing ancestor directory on the same device. The metric then recovers on the next sampling instead of staying broken forever.
A failure that persists even after re-resolving (practically impossible, since the lookup ultimately falls back to an existing directory) is logged at WARN instead of ERROR, so it can no longer flood the log.
fileStores / diskDirs are made volatile and the re-resolution is done copy-on-write, since the getters are invoked concurrently from the heartbeat and Prometheus-reporter threads.

Behavior

No behavioral change on the happy path: when all directories exist, the reported total/free/available disk space is identical to before.
When a backing directory disappears, the metric self-heals on the next sample (re-binding to a still-existing ancestor on the same device) rather than returning 0 and spamming ERROR logs.

PingCode: V2-974

This PR has:

been self-reviewed.
- concurrent read
- concurrent write
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths.

Key changed/added classes (or packages if there are too many classes) in this PR

org.apache.iotdb.metrics.metricsets.system.SystemMetrics
org.apache.iotdb.metrics.metricsets.system.SystemMetricsTest (new)

…emoved A cached FileStore pins the exact path it was resolved from. When that path is deleted while IoTDB is running (e.g. an empty data region directory removed during region migration), every disk-space query against the stale FileStore throws NoSuchFileException, which was logged at ERROR on every heartbeat and flooded the DataNode log. Store the configured disk dirs and, when a space query fails, re-resolve the FileStores once via FileStoreUtils#getFileStore (which walks up to an existing ancestor on the same device) so the metric recovers on the next sampling. Remaining failures are logged at WARN instead of ERROR.

sonarqubecloud · 2026-06-12T08:41:45Z

Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

codecov · 2026-06-12T09:19:01Z

Codecov Report

❌ Patch coverage is 95.83333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 41.08%. Comparing base (abb9ef9) to head (2e3563e).

Files with missing lines	Patch %	Lines
...iotdb/metrics/metricsets/system/SystemMetrics.java	95.83%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #17931      +/-   ##
============================================
+ Coverage     41.07%   41.08%   +0.01%     
- Complexity      318      333      +15     
============================================
  Files          5257     5257              
  Lines        365010   365010              
  Branches      47180    47180              
============================================
+ Hits         149918   149970      +52     
+ Misses       215092   215040      -52

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Caideyipi · 2026-06-12T10:09:09Z

I found two issues that should be addressed before merging:

SystemMetrics.java:65-66,337: volatile does not make the mutable HashSet/List thread-safe. collectDiskSpace() iterates the current fileStores snapshot, but removeSystemDiskInfo() still mutates that same HashSet in place via fileStores.clear(). If an auto-gauge scrape or heartbeat is iterating while metrics are unbound, this can throw ConcurrentModificationException. This also matches the Sonar java:S3077 findings on the new volatile collection fields. A safer copy-on-write shape would be to publish immutable snapshots, for example initialize with Collections.emptySet()/Collections.emptyList(), assign new unmodifiable copies in setDiskDirs() and refreshFileStores(), and replace fileStores.clear() with this.fileStores = Collections.emptySet().
SystemMetrics.java:374: Sonar reports if (refreshed) as an always-false condition, and the current SonarCloud Code Analysis check is failing because of it. The runtime intent looks like two attempts: read once, refresh on failure, then read once more and warn for any remaining failures. I think it would be clearer and more analysis-friendly to express that directly, rather than using while (true) plus the refreshed flag.

I verified the new test locally on the PR head with:

mvn -nsu -pl iotdb-core/metrics/interface -Dtest=SystemMetricsTest test

The test passed, including checkstyle and spotless in that Maven run. The remaining blocker I see is the SonarCloud failure above.

CRZbulabula mentioned this pull request Jun 12, 2026

Recover disk-space metrics when a cached FileStore's directory is removed during region migration #17930

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover disk-space metrics when a cached FileStore's directory is removed during region migration#17931

Recover disk-space metrics when a cached FileStore's directory is removed during region migration#17931
CRZbulabula wants to merge 1 commit into
masterfrom
fix_v2_974_system_metrics_filestore

CRZbulabula commented Jun 12, 2026

Uh oh!

sonarqubecloud Bot commented Jun 12, 2026

Uh oh!

codecov Bot commented Jun 12, 2026

Uh oh!

Caideyipi commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CRZbulabula commented Jun 12, 2026

Description

Root cause

Fix

Behavior

Key changed/added classes (or packages if there are too many classes) in this PR

Uh oh!

sonarqubecloud Bot commented Jun 12, 2026

Quality Gate failed

Uh oh!

codecov Bot commented Jun 12, 2026

Codecov Report

Uh oh!

Caideyipi commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants