Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ See cephtrace in action troubleshooting real performance issues.
| **DWARF Files** | Managing debug information for tracing | [Guide](doc/dwarf-json-files.md) |

### 📊 Analysis & Tools
- **[Visualizing Trace Output](doc/visualizing-output.md)** - One-screen terminal summary (latency histogram, where the time goes), or a self-contained interactive HTML report, from any capture
- **[Analyzing Radostrace Logs](doc/analyze-radostrace.md)** - Extract insights from client traces
- **[Analyzing Osdtrace Logs](doc/analyze-osdtrace.md)** - Deep-dive into OSD performance data

Expand Down
4 changes: 4 additions & 0 deletions doc/analyze-osdtrace.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Osdtrace Analysis Tool

> For a quick visual overview (where latency is spent, distribution shape,
> per-OSD outliers) before diving into the numbers, see
> [Visualizing trace output](visualizing-output.md).

The `analyze_osdtrace_output.py` script provides automated statistical analysis of osdtrace log files with fio-style latency distribution output.

## Overview
Expand Down
4 changes: 4 additions & 0 deletions doc/analyze-radostrace.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Radostrace Analysis Tool

> For a quick visual overview (latency histogram, read/write split, slow-OSD
> hint) before diving into the numbers, see
> [Visualizing trace output](visualizing-output.md).

The `analyze_radostrace_output.py` script provides automated analysis of radostrace log files to identify problematic OSDs involved in high-latency operations.

## Overview
Expand Down
Binary file added doc/images/report-osdtrace.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/report-radostrace.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
136 changes: 136 additions & 0 deletions doc/visualizing-output.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Visualizing trace output

`radostrace` and `osdtrace` stream one line per IO - thousands of rows in a
typical capture. `tools/cephtrace_report.py` turns a capture into a
**one-screen visual summary** that answers the three questions you usually have
at a glance: *is it slow, what's slow, and where is the time going.*

It has no flags to remember - point it at a log (or pipe one in) and it
auto-detects which tool produced it.

```bash
# osdtrace
sudo ./osdtrace --id 0 -t 30 > osd.log
./tools/cephtrace_report.py osd.log

# radostrace (also reads the CSV that `radostrace -o` writes)
sudo ./radostrace -p <pid> -t 30 | ./tools/cephtrace_report.py
```

## osdtrace example

Real output from a ~116k-op capture (a single OSD under a 4 KiB write+read
workload):

```
osdtrace summary - 115829 OSD ops
op_r 86926 op_w 28903

where the time goes (share of total latency, avg 4.1ms/op)
messenger ▏ 0%
queue ███▋ 9%
osd ▉ 2%
bluestore ████████████████▌ 41%
+kv_commit ██████████████▊ 37% (rocksdb commit)

op latency distribution p50 4.0ms p95 9.5ms p99 11.3ms max 30.1ms
<250us ████████████ 11896
<500us ██████████▌ 10420
<1ms ███████████████▊ 15594
<2ms █████████████▋ 13440
<4ms ██████▊ 6647
<8ms ████████████████████████████████████████ 39612
<16ms ██████████████████▍ 18164
<32ms 56

slowest ops
30.1ms op_w osd.0 pg 2.12 4.0K
28.0ms op_w osd.0 pg 2.e 4.0K

-> deeper: ./tools/analyze_osdtrace_output.py <log> [-i | -f kv_commit | -o <osd>]
```

The **where the time goes** block is the headline: it turns osdtrace's dense
per-stage numbers into an instant "your latency is RocksDB commit, not network
or queueing." Because the stages overlap (`op_lat` is not their sum), the
shares are of total latency and need not add up to 100% - the point is which
stage dominates.

When a capture spans **more than one OSD**, an extra **per-OSD p95 latency**
block is drawn (highest p95 first), so an outlier OSD stands out immediately -
the single-OSD capture above omits it.

## radostrace example

```
radostrace summary - 353 client ops
reads 152 ██████████████████████████████▎ 43%
writes 201 ████████████████████████████████████████ 56%

latency distribution p50 3.5ms p95 15.6ms p99 19.4ms max 19.7ms
<1ms ███████████████▋ 36
<8ms ████████████████████████████████████████ 92
<16ms █████████████████████████████▋ 68

OSDs in the acting set of the slowest 17 ops
osd.0 ████████████████████████ 17

-> deeper: ./tools/analyze_radostrace_output.py <log> <osd_tree> [threshold_us]
```

The **OSDs in the acting set of the slowest ops** block is a quick culprit
pointer; for the full iterative ranking mapped to physical hosts, use the
analyzer it points to.

## Interactive HTML report

For a richer, shareable view, add `--html <file>` to write a **self-contained**
HTML report instead of terminal text:

```bash
./tools/cephtrace_report.py --html report.html osd.log
# wrote report.html (osdtrace, 115829 ops)
```

Open it in any browser, or attach it to a ticket - it is a single file with no
external dependencies (no CDN, works offline). It shows everything the terminal
view does plus things a terminal can't:

![osdtrace HTML report](images/report-osdtrace.png)

- **Latency over the capture** - p95 per slice in arrival order, so a workload
that degrades over time (or runs in bursts) is visible as a shape.
- **Interactive**: hover any bar for exact counts/latencies; click a column
header to sort the per-OSD / per-pool and slowest-ops tables; click a row in
the per-group table to **filter the histogram to that OSD or pool** (and
"show all" to reset).

The radostrace report is the same layout with a read/write split, per-pool
breakdown, and the object name on each slowest op:

![radostrace HTML report](images/report-radostrace.png)

The report embeds only **precomputed aggregates** (histograms, per-slice
percentiles, per-group stats, the 50 slowest ops), so even a multi-hundred-
thousand-op capture produces a ~20 KB file.

## Relationship to the analyzers

This tool is the visual *glance*. It deliberately reuses the existing parsers
(`analyze_osdtrace_output.parse_line` and `analyze_radostrace_output`'s format
detection) rather than maintaining a second one. When you need the numbers,
not the shape, use the detailed analyzers:

- [Analyzing osdtrace logs](analyze-osdtrace.md) - fio-style percentile tables,
per-stage contribution (`-i`), per-OSD / per-field filtering
- [Analyzing radostrace logs](analyze-radostrace.md) - iterative culprit-OSD
ranking mapped to hosts via `ceph osd tree`

## Notes

- Pure Python 3 standard library - no dependencies to install.
- The bars use Unicode block characters; any modern terminal renders them.
Section headings are bold only when writing to a terminal, so piping or
redirecting to a file stays clean.
- Reads either the default space-separated output or the CSV produced by
`radostrace -o`.
169 changes: 169 additions & 0 deletions tests/test_cephtrace_report.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
"""
Smoke tests for cephtrace_report.py.

These run against the small committed sample logs in tools/sample-logs/ so
they are offline and fast. The module is imported directly (tox sets
PYTHONPATH=tools); cephtrace_report has a hyphen-free name precisely so it is
importable here.
"""
import json
import os
import re
import runpy
import sys

import pytest

import cephtrace_report # pylint: disable=E0401

SAMPLES = os.path.join(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
"tools", "sample-logs",
)
OSD_LOG = os.path.join(SAMPLES, "osdtrace_data.log")
RADOS_LOG = os.path.join(SAMPLES, "radostrace_data.log")
RADOS_CSV = os.path.join(SAMPLES, "radostrace_data.csv")


def _run(path, capsys):
"""Run the tool's main() against a file path and return stdout."""
argv = sys.argv
sys.argv = ["cephtrace_report.py", path]
try:
rc = cephtrace_report.main()
finally:
sys.argv = argv
return rc, capsys.readouterr().out


def test_osdtrace_report(capsys):
"""osdtrace sample renders every headline section and the op types."""
rc, out = _run(OSD_LOG, capsys)
assert rc == 0
assert "osdtrace summary - 11 OSD ops" in out
for token in ("op_r", "op_w", "subop_w"):
assert token in out
assert "where the time goes" in out
assert "op latency distribution" in out
# bluestore dominates this write-heavy sample; kv_commit must surface
assert "kv_commit" in out
# two distinct OSDs in the sample -> per-OSD section is drawn
assert "per-OSD p95 latency" in out
assert "slowest ops" in out
assert "analyze_osdtrace_output.py" in out


def test_radostrace_space_report(capsys):
"""Space-formatted radostrace sample renders the client-side sections."""
rc, out = _run(RADOS_LOG, capsys)
assert rc == 0
assert "radostrace summary" in out
assert "reads" in out and "writes" in out
assert "latency distribution" in out
assert "slowest ops" in out
assert "analyze_radostrace_output.py" in out


def test_radostrace_csv_report(capsys):
"""CSV-formatted (radostrace -o) sample is detected and rendered."""
rc, out = _run(RADOS_CSV, capsys)
assert rc == 0
assert "radostrace summary" in out
assert "latency distribution" in out


def _embedded_json(html):
"""Pull the embedded payload back out of a generated report."""
match = re.search(r"const DATA = (\{.*?\});", html, re.S)
return json.loads(match.group(1))


def test_html_osdtrace(tmp_path, capsys):
"""--html on osdtrace writes a self-contained report with valid payload."""
out = tmp_path / "osd.html"
argv = sys.argv
sys.argv = ["cephtrace_report.py", "--html", str(out), OSD_LOG]
try:
rc = cephtrace_report.main()
finally:
sys.argv = argv
assert rc == 0
assert f"wrote {out}" in capsys.readouterr().out
html = out.read_text(encoding="utf-8")
# self-contained: placeholder substituted, no external script/style refs
assert "__CEPHTRACE_DATA__" not in html
assert "<script src" not in html and "http" not in html.split("foot")[0]
payload = _embedded_json(html)
assert payload["kind"] == "osdtrace"
assert payload["summary"]["ops"] == 11
# two OSDs in the sample -> per-group histograms for cross-filtering
assert payload["groupLabel"] == "OSD"
assert sorted(payload["histGroups"]) == ["0", "2"]
assert len(payload["histOverall"]) == len(payload["histLabels"])
assert payload["stages"] and payload["kvCommit"] > 0
assert payload["timeline"] and payload["slowest"]


def test_html_radostrace(tmp_path):
"""--html on radostrace (CSV input) writes a valid pool-keyed report."""
out = tmp_path / "rados.html"
argv = sys.argv
sys.argv = ["cephtrace_report.py", "--html", str(out), RADOS_CSV]
try:
rc = cephtrace_report.main()
finally:
sys.argv = argv
assert rc == 0
payload = _embedded_json(out.read_text(encoding="utf-8"))
assert payload["kind"] == "radostrace"
assert payload["groupLabel"] == "Pool"
assert "reads" in payload["summary"] and "writes" in payload["summary"]
assert payload["slowest"][0]["lat"] >= payload["slowest"][-1]["lat"]


def test_html_requires_filename(capsys):
"""--html with no following filename fails with rc=2."""
argv = sys.argv
sys.argv = ["cephtrace_report.py", "--html"]
try:
rc = cephtrace_report.main()
finally:
sys.argv = argv
assert rc == 2
assert "requires an output filename" in capsys.readouterr().err


def test_no_rows_returns_error(tmp_path, capsys):
"""A file with no recognizable trace rows fails cleanly (rc=1)."""
junk = tmp_path / "junk.log"
junk.write_text("hello\nworld\n", encoding="utf-8")
rc, _ = _run(str(junk), capsys)
assert rc == 1


def test_help_runs(capsys):
"""--help prints the module docstring and exits 0."""
argv = sys.argv
sys.argv = ["cephtrace_report.py", "--help"]
try:
rc = cephtrace_report.main()
finally:
sys.argv = argv
assert rc == 0
assert "one-screen visual summary" in capsys.readouterr().out


def test_runs_as_script():
"""The committed osdtrace sample runs end-to-end as a script (no import
side-effects, exit code 0)."""
argv = sys.argv
sys.argv = ["cephtrace_report.py", OSD_LOG]
try:
with pytest.raises(SystemExit) as exc:
runpy.run_path(
os.path.join(SAMPLES, "..", "cephtrace_report.py"),
run_name="__main__",
)
assert exc.value.code == 0
finally:
sys.argv = argv
Loading
Loading