Summary
osdtrace can emit a garbage near-2⁶⁴ latency value for a BlueStore sub-stage (and for phantom peer entries) when the underlying timestamps are missing or arrive out of order. The latency fields are computed as an unsigned (end_stamp - start_stamp) / 1000, with no guard for end < start or start == 0, so an underflow wraps to ~1.8e16.
This reaches real user output, and it intermittently fails the cephadm functional test's sub-latency sanity check (bluestore_lat must not exceed op_lat).
Evidence
Captured from CI (cephadm + radosgw, squid, ubuntu-22.04) — the single offending row out of 857 op_w/subop_w rows in that run (≈0.1%):
osd 1 pg 7.0 op_w size 65 client 14228 tid 1778 throttle_lat 0 recv_lat 6 dispatch_lat 24 queue_lat 101 osd_lat 28370 peers [(2, 1746), (-1, 18446743718958803)] bluestore_lat 18446744073695038 (prepare 25 aio_wait 0 (aio_size 0) seq_wait 9 kv_commit 18446744073695003) op_lat 29397
The verifier caught it as:
ERROR: Sub-latency bluestore_lat=18446744073695038 µs > op_lat=29397 µs (op=op_w osd=1 pool=7 tid=1778)
Two symptoms of the same defect in that one row:
kv_commit 18446744073695003 — bluestore_lat then inherits it (18446744073695038).
- a phantom peer
(-1, 18446743718958803) — an unused replica slot (osd id -1) with an underflowed latency.
18446744073695003 ≈ 2⁶⁴/1000 − 14548, i.e. an unsigned uint64 ns underflow (kv_committed_stamp was ~14.5 ms before kv_submit_stamp), then divided by 1000.
Root cause
The latency math in src/osdtrace.cc subtracts two bpf_ktime_get_boot_ns() stamps without checking ordering or that both were set:
// src/osdtrace.cc:662-665
op.bs_pg_seq_lat = (val->kv_submit_stamp - val->aio_done_stamp) / 1000;
op.bs_kv_commit_lat = (val->kv_committed_stamp - val->kv_submit_stamp) / 1000; // <-- underflows
...
op.bs_lat = (val->kv_committed_stamp - val->queue_transaction_stamp)/ 1000;
The stamps are written from _txc_state_proc per BlueStore txc state (src/osdtrace.bpf.c:577-589): state 2 → kv_submit_stamp, state 4 → kv_committed_stamp. For some transactions these states fire out of order, are skipped (e.g. deferred/simple writes), or a reused map entry carries a stamp from a previous op — leaving kv_submit_stamp == 0 or kv_committed_stamp < kv_submit_stamp. With unsigned subtraction the result wraps to ≈2⁶⁴.
The peer latencies have the same shape (a -1 peer slot that was never populated).
Impact
- Users: a nonsensical
~1.8e16 µs (≈584,000 years) latency occasionally appears in osdtrace output, breaking any downstream parsing/aggregation (e.g. the analyzers, percentile math).
- CI: intermittent failures of
functional-test-cephadm-rgw.sh Step "verify osdtrace output" (the bluestore_lat > op_lat check), load/timing dependent — it failed on 1 of 8 cephadm matrix cells in a recent run while the rest passed.
Suggested fix
Guard each (end - start)/1000 so an unset/out-of-order pair yields 0 (or is dropped) instead of underflowing — e.g. a small helper:
static inline __u64 lat_us(__u64 end, __u64 start) {
return (start == 0 || end < start) ? 0 : (end - start) / 1000;
}
and apply it to the BlueStore sub-stage latencies (bs_kv_commit_lat, bs_pg_seq_lat, bs_lat, etc.) and the peer latencies, plus skip peer slots whose osd id is < 0.
How to reproduce
Runs intermittently under load; surfaced in CI: functional-test-cephadm-rgw.sh squid on ubuntu-22.04. Full trace log attached to that run as the trace-logs-ubuntu-22.04-squid artifact.
Summary
osdtracecan emit a garbage near-2⁶⁴ latency value for a BlueStore sub-stage (and for phantom peer entries) when the underlying timestamps are missing or arrive out of order. The latency fields are computed as an unsigned(end_stamp - start_stamp) / 1000, with no guard forend < startorstart == 0, so an underflow wraps to~1.8e16.This reaches real user output, and it intermittently fails the cephadm functional test's sub-latency sanity check (
bluestore_latmust not exceedop_lat).Evidence
Captured from CI (cephadm + radosgw, squid, ubuntu-22.04) — the single offending row out of 857 op_w/subop_w rows in that run (≈0.1%):
The verifier caught it as:
Two symptoms of the same defect in that one row:
kv_commit 18446744073695003—bluestore_latthen inherits it (18446744073695038).(-1, 18446743718958803)— an unused replica slot (osd id-1) with an underflowed latency.18446744073695003 ≈ 2⁶⁴/1000 − 14548, i.e. an unsigneduint64ns underflow (kv_committed_stampwas ~14.5 ms beforekv_submit_stamp), then divided by 1000.Root cause
The latency math in
src/osdtrace.ccsubtracts twobpf_ktime_get_boot_ns()stamps without checking ordering or that both were set:The stamps are written from
_txc_state_procper BlueStore txc state (src/osdtrace.bpf.c:577-589): state 2 →kv_submit_stamp, state 4 →kv_committed_stamp. For some transactions these states fire out of order, are skipped (e.g. deferred/simple writes), or a reused map entry carries a stamp from a previous op — leavingkv_submit_stamp == 0orkv_committed_stamp < kv_submit_stamp. With unsigned subtraction the result wraps to ≈2⁶⁴.The peer latencies have the same shape (a
-1peer slot that was never populated).Impact
~1.8e16 µs(≈584,000 years) latency occasionally appears in osdtrace output, breaking any downstream parsing/aggregation (e.g. the analyzers, percentile math).functional-test-cephadm-rgw.shStep "verify osdtrace output" (thebluestore_lat > op_latcheck), load/timing dependent — it failed on 1 of 8 cephadm matrix cells in a recent run while the rest passed.Suggested fix
Guard each
(end - start)/1000so an unset/out-of-order pair yields0(or is dropped) instead of underflowing — e.g. a small helper:and apply it to the BlueStore sub-stage latencies (
bs_kv_commit_lat,bs_pg_seq_lat,bs_lat, etc.) and the peer latencies, plus skip peer slots whose osd id is< 0.How to reproduce
Runs intermittently under load; surfaced in CI:
functional-test-cephadm-rgw.sh squidon ubuntu-22.04. Full trace log attached to that run as thetrace-logs-ubuntu-22.04-squidartifact.