Skip to content

Recover disk-space metrics when a cached FileStore's directory is removed during region migration#17931

Open
CRZbulabula wants to merge 1 commit into
masterfrom
fix_v2_974_system_metrics_filestore
Open

Recover disk-space metrics when a cached FileStore's directory is removed during region migration#17931
CRZbulabula wants to merge 1 commit into
masterfrom
fix_v2_974_system_metrics_filestore

Conversation

@CRZbulabula

Copy link
Copy Markdown
Contributor

Description

This is a follow-up to #17880 ("Fix empty snapshot loading and region cleanup"), addressing the second problem reported in the same scenario: a cluster that contains an empty DataRegion (auto-created by the ConfigNode after a scale-out, carrying 0 SeriesPartitionSlot) being migrated during scale operations.

While #17880 fixed the empty-snapshot loading (SnapshotLoader) and the region-cleanup timeout (TableDiskUsageIndex / DataRegion), the affected DataNode kept flooding its log with ERROR entries like:

ERROR o.a.i.m.m.s.SystemMetrics:366 - Failed to statistic the size of /data (/dev/sdb1), because
java.nio.file.NoSuchFileException: /data/.../data/datanode/system
	at ...
	at org.apache.iotdb.metrics.metricsets.system.SystemMetrics.getSystemDiskAvailableSpace(SystemMetrics.java:364)
	at org.apache.iotdb.metrics.core.type.IoTDBAutoGauge.getValue(IoTDBAutoGauge.java:43)
	at ...DataNodeInternalRPCServiceImpl.sampleDiskLoad(...)
	at ...getDataNodeHeartBeat(...)

Root cause

SystemMetrics#setDiskDirs resolves each configured disk directory into a java.nio.file.FileStore once at startup and caches the resulting objects. A FileStore pins the exact path it was resolved from; on Linux every getTotalSpace() / getUnallocatedSpace() / getUsableSpace() call re-runs statvfs on that pinned path.

When that directory is removed while IoTDB is running (e.g. an empty data region directory is deleted during region migration), the pinned path no longer exists and every space query throws NoSuchFileException. Because disk metrics are sampled on every DataNode heartbeat (and on every Prometheus scrape), the stale FileStore was logged at ERROR on every sampling, never recovered, and flooded the log.

Fix

  • SystemMetrics now also stores the configured disk dirs.
  • When a space query against a cached FileStore fails, the FileStore set is re-resolved once via FileStoreUtils#getFileStore, which walks up to an existing ancestor directory on the same device. The metric then recovers on the next sampling instead of staying broken forever.
  • A failure that persists even after re-resolving (practically impossible, since the lookup ultimately falls back to an existing directory) is logged at WARN instead of ERROR, so it can no longer flood the log.
  • fileStores / diskDirs are made volatile and the re-resolution is done copy-on-write, since the getters are invoked concurrently from the heartbeat and Prometheus-reporter threads.

Behavior

  • No behavioral change on the happy path: when all directories exist, the reported total/free/available disk space is identical to before.
  • When a backing directory disappears, the metric self-heals on the next sample (re-binding to a still-existing ancestor on the same device) rather than returning 0 and spamming ERROR logs.

PingCode: V2-974


This PR has:

  • been self-reviewed.
    • concurrent read
    • concurrent write
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.

Key changed/added classes (or packages if there are too many classes) in this PR
  • org.apache.iotdb.metrics.metricsets.system.SystemMetrics
  • org.apache.iotdb.metrics.metricsets.system.SystemMetricsTest (new)

…emoved

A cached FileStore pins the exact path it was resolved from. When that
path is deleted while IoTDB is running (e.g. an empty data region
directory removed during region migration), every disk-space query
against the stale FileStore throws NoSuchFileException, which was logged
at ERROR on every heartbeat and flooded the DataNode log.

Store the configured disk dirs and, when a space query fails, re-resolve
the FileStores once via FileStoreUtils#getFileStore (which walks up to an
existing ancestor on the same device) so the metric recovers on the next
sampling. Remaining failures are logged at WARN instead of ERROR.
@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.83333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 41.08%. Comparing base (abb9ef9) to head (2e3563e).

Files with missing lines Patch % Lines
...iotdb/metrics/metricsets/system/SystemMetrics.java 95.83% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17931      +/-   ##
============================================
+ Coverage     41.07%   41.08%   +0.01%     
- Complexity      318      333      +15     
============================================
  Files          5257     5257              
  Lines        365010   365010              
  Branches      47180    47180              
============================================
+ Hits         149918   149970      +52     
+ Misses       215092   215040      -52     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Caideyipi

Copy link
Copy Markdown
Collaborator

I found two issues that should be addressed before merging:

  1. SystemMetrics.java:65-66,337: volatile does not make the mutable HashSet/List thread-safe. collectDiskSpace() iterates the current fileStores snapshot, but removeSystemDiskInfo() still mutates that same HashSet in place via fileStores.clear(). If an auto-gauge scrape or heartbeat is iterating while metrics are unbound, this can throw ConcurrentModificationException. This also matches the Sonar java:S3077 findings on the new volatile collection fields. A safer copy-on-write shape would be to publish immutable snapshots, for example initialize with Collections.emptySet()/Collections.emptyList(), assign new unmodifiable copies in setDiskDirs() and refreshFileStores(), and replace fileStores.clear() with this.fileStores = Collections.emptySet().

  2. SystemMetrics.java:374: Sonar reports if (refreshed) as an always-false condition, and the current SonarCloud Code Analysis check is failing because of it. The runtime intent looks like two attempts: read once, refresh on failure, then read once more and warn for any remaining failures. I think it would be clearer and more analysis-friendly to express that directly, rather than using while (true) plus the refreshed flag.

I verified the new test locally on the PR head with:

mvn -nsu -pl iotdb-core/metrics/interface -Dtest=SystemMetricsTest test

The test passed, including checkstyle and spotless in that Maven run. The remaining blocker I see is the SonarCloud failure above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants