Skip to content

osdtrace: unsigned underflow yields garbage ~2^64 latency (kv_commit / bluestore_lat / phantom peers) #147

Description

@taodd

Summary

osdtrace can emit a garbage near-2⁶⁴ latency value for a BlueStore sub-stage (and for phantom peer entries) when the underlying timestamps are missing or arrive out of order. The latency fields are computed as an unsigned (end_stamp - start_stamp) / 1000, with no guard for end < start or start == 0, so an underflow wraps to ~1.8e16.

This reaches real user output, and it intermittently fails the cephadm functional test's sub-latency sanity check (bluestore_lat must not exceed op_lat).

Evidence

Captured from CI (cephadm + radosgw, squid, ubuntu-22.04) — the single offending row out of 857 op_w/subop_w rows in that run (≈0.1%):

osd 1 pg 7.0 op_w size 65 client 14228 tid 1778 throttle_lat 0 recv_lat 6 dispatch_lat 24 queue_lat 101 osd_lat 28370 peers [(2, 1746), (-1, 18446743718958803)] bluestore_lat 18446744073695038 (prepare 25 aio_wait 0 (aio_size 0) seq_wait 9 kv_commit 18446744073695003) op_lat 29397

The verifier caught it as:

ERROR: Sub-latency bluestore_lat=18446744073695038 µs > op_lat=29397 µs (op=op_w osd=1 pool=7 tid=1778)

Two symptoms of the same defect in that one row:

  • kv_commit 18446744073695003bluestore_lat then inherits it (18446744073695038).
  • a phantom peer (-1, 18446743718958803) — an unused replica slot (osd id -1) with an underflowed latency.

18446744073695003 ≈ 2⁶⁴/1000 − 14548, i.e. an unsigned uint64 ns underflow (kv_committed_stamp was ~14.5 ms before kv_submit_stamp), then divided by 1000.

Root cause

The latency math in src/osdtrace.cc subtracts two bpf_ktime_get_boot_ns() stamps without checking ordering or that both were set:

// src/osdtrace.cc:662-665
op.bs_pg_seq_lat   = (val->kv_submit_stamp    - val->aio_done_stamp)        / 1000;
op.bs_kv_commit_lat = (val->kv_committed_stamp - val->kv_submit_stamp)       / 1000;  // <-- underflows
...
op.bs_lat          = (val->kv_committed_stamp - val->queue_transaction_stamp)/ 1000;

The stamps are written from _txc_state_proc per BlueStore txc state (src/osdtrace.bpf.c:577-589): state 2 → kv_submit_stamp, state 4 → kv_committed_stamp. For some transactions these states fire out of order, are skipped (e.g. deferred/simple writes), or a reused map entry carries a stamp from a previous op — leaving kv_submit_stamp == 0 or kv_committed_stamp < kv_submit_stamp. With unsigned subtraction the result wraps to ≈2⁶⁴.

The peer latencies have the same shape (a -1 peer slot that was never populated).

Impact

  • Users: a nonsensical ~1.8e16 µs (≈584,000 years) latency occasionally appears in osdtrace output, breaking any downstream parsing/aggregation (e.g. the analyzers, percentile math).
  • CI: intermittent failures of functional-test-cephadm-rgw.sh Step "verify osdtrace output" (the bluestore_lat > op_lat check), load/timing dependent — it failed on 1 of 8 cephadm matrix cells in a recent run while the rest passed.

Suggested fix

Guard each (end - start)/1000 so an unset/out-of-order pair yields 0 (or is dropped) instead of underflowing — e.g. a small helper:

static inline __u64 lat_us(__u64 end, __u64 start) {
    return (start == 0 || end < start) ? 0 : (end - start) / 1000;
}

and apply it to the BlueStore sub-stage latencies (bs_kv_commit_lat, bs_pg_seq_lat, bs_lat, etc.) and the peer latencies, plus skip peer slots whose osd id is < 0.

How to reproduce

Runs intermittently under load; surfaced in CI: functional-test-cephadm-rgw.sh squid on ubuntu-22.04. Full trace log attached to that run as the trace-logs-ubuntu-22.04-squid artifact.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions