T A P I O
Opinionated eBPF observer for Linux and Kubernetes systems
Tapio watches selected kernel facts and emits structured anomaly evidence.
It is not a generic eBPF framework. It is not a dashboard. It is not a root-cause engine. Tapio keeps the node agent small, fast, and boringly reliable.
Tapio emits evidence, not exhaust.
Tapio observes kernel-level failure facts that are easy to miss from application logs alone:
- a TCP connection was refused;
- a process was killed by OOM;
- a block I/O completed with an error or abnormal latency;
- CPU performance counters show stalls or IPC degradation.
Tapio turns those facts into named kernel.* anomaly events. Downstream systems can store, correlate, explain, alert, or remediate. Tapio does not guess.
Tapio is a Rust workspace with six crates:
| Crate | Role |
|---|---|
tapio-agent |
Linux node agent. Loads eBPF, drains ring buffers, classifies facts, emits to sinks. |
tapio-controller |
Minimal cluster coordination skeleton for the agent/controller boundary. |
tapio-profile |
Pure Evidence Profile validation and compilation into wire config. |
tapio-wire |
Shared JSON protocol structs for the agent/controller boundary. |
tapio-cli |
Platform-independent CLI and MCP server for local event files. |
tapio-common |
Shared ABI structs, occurrence schema, anomaly constants, and sink traits. |
The product split is intentional:
tapio-profile validates and compiles.
tapio-controller coordinates.
tapio-agent observes.
downstream systems explain.
The node agent must not grow Kubernetes watches, controller HTTP server dependencies, dashboards, policy engines, or reasoning fields.
| Observer | Kernel sources | Emitted anomalies |
|---|---|---|
| Network | inet_sock_set_state, tcp_receive_reset, tcp_retransmit_skb |
kernel.network.connection_refused, kernel.network.connection_timeout, kernel.network.retransmit_spike, kernel.network.rtt_degradation |
| Container | sched_process_exit, oom/mark_victim |
kernel.container.oom_kill, kernel.container.abnormal_exit |
| Storage | block_rq_issue, block_rq_complete |
kernel.storage.io_error, kernel.storage.latency_spike |
| Node PMC | perf_event counters |
kernel.node.cpu_stall, kernel.node.memory_pressure, kernel.node.ipc_degradation |
Anomaly names are stable product concepts. They are not arbitrary metric labels.
kernel tracepoints / perf events
-> eBPF programs
-> ring buffers
-> Rust userspace parser
-> anomaly classification
-> sinks
Tapio filters close to the source:
- BPF-side filtering: cheap, obvious drops before crossing into userspace.
- Rust-side classification: threshold and state checks where userspace context is safer.
The BPF/Rust boundary is fixed by C structs mirrored in tapio-common/src/ebpf.rs. Tests assert size and field offsets so layout drift is caught before runtime.
Storage has a strict correctness rule: if block_rq_issue and block_rq_complete cannot uniquely correlate an inflight request, Tapio drops that completion and increments tapio_correlation_drops_total{observer="storage",reason="ambiguous_inflight_io"}. Missing evidence can be counted. Wrong evidence is not emitted.
Tapio emits occurrence JSON. Fields are factual: timestamp, anomaly type, severity, outcome, error, and kernel data. Reasoning fields are intentionally absent.
{
"id": "01JA1B2C3D4E5F6G7H8J9K0L1M",
"timestamp": "2026-04-03T14:23:01.042Z",
"source": "tapio",
"type": "kernel.container.oom_kill",
"protocol_version": "1.0",
"severity": "critical",
"outcome": "failure",
"error": {
"code": "OOM_KILL",
"message": "OOM kill pid=1234 (usage=512MB, limit=0MB)"
},
"data": {
"pid": 1234,
"tid": 1234,
"exit_code": 137,
"signal": 9,
"memory_usage_bytes": 536870912,
"memory_limit_bytes": 0,
"cgroup_id": 8429,
"config_generation": 1,
"timestamp_ns": 1743691381042000000
}
}memory_limit_bytes can be 0 for OOM kills because the kernel tracepoint does not expose the cgroup limit. Cluster context belongs in tapio-controller or downstream systems, not in the node hot path.
config_generation records which runtime config judged the event, so fleet config convergence is observable in the event stream itself.
cargo test --workspace
cargo clippy --workspace --all-targets -- -D warnings
scripts/verify-lean.shOn Linux or inside Lima, also run:
scripts/smoke-ebpf-network.sh
scripts/smoke-agent-controller.shThe eBPF smoke test builds the agent, loads real eBPF programs, triggers a TCP connection to a closed localhost port, and checks that Tapio records a network occurrence with the exact destination port. The agent/controller smoke test builds the agent and controller, then verifies hello, heartbeat, event payload delivery through /v1/status and trace logs, controller outage behavior, recovery, and agent restart.
tapio-agent is Linux-only. It requires kernel 5.8+ with BTF and the capabilities CAP_BPF, CAP_PERFMON, and CAP_NET_ADMIN.
tapio-agent --sink=stdout
tapio-agent --sink=file
tapio-agent --sink=stdout --sink=file
tapio-agent --controller-endpoint=http://tapio-controller:8080 --sink=controller
tapio-agent --ebpf-dir /opt/tapio/ebpfImportant flags:
--config <path>: TOML config file, default/etc/tapio/tapio.toml.--sink <name>:stdout,file,http,controller, orotlpwhen built with--features otlp; repeatable.--ebpf-dir <path>: directory containing compiled.ofiles.--data-dir <path>: file sink output directory, default.tapio/occurrences.--http-endpoint <url>: HTTP sink endpoint.--controller-endpoint <url>: controller base URL for config, hello, heartbeat, and controller event sink traffic.--heartbeat-interval <seconds>: controller heartbeat interval, minimum 5 seconds.
The agent runs standalone with stdout or file sinks. It does not need Kubernetes, a controller, or a network destination.
The tapio CLI reads local occurrence files. It does not talk to the running agent.
tapio status
tapio watch
tapio watch --network
tapio watch --json
tapio recent
tapio recent --since 5m
tapio health
tapio health networkBy default, the CLI reads .tapio/occurrences/*.json. Override that with --data-dir or TAPIO_DATA_DIR.
The CLI also provides tapio mcp, a read-only stdio JSON-RPC MCP server for querying recent anomalies and node health.
| Sink | Output |
|---|---|
stdout |
JSON lines to stdout |
file |
one occurrence JSON file per event |
http |
batched JSON POST to an HTTP endpoint |
controller |
bounded batched tapio-wire/v1 event POSTs to POST /v1/events |
otlp |
OTLP/HTTP logs export when built with --features otlp |
Sink guarantees:
- Local
stdoutandfilesinks are the default zero-service path. - Controller sink overflow and send failures are counted and never block ring-buffer consumption.
- HTTP/OTLP export failures are surfaced as sink errors and counters.
otlprejectshttps://endpoints before opening a TCP connection or sending auth. Use a local collector, proxy, sidecar, or service mesh for TLS termination.
Prometheus metrics are optional and disabled by default. Enable them in TOML with [metrics] enabled = true.
Key metrics:
tapio_events_total: ring-buffer records drained by userspace.tapio_anomalies_total: emitted anomalies by observer and type.tapio_lost_events_total: eBPF ring-buffer reserve failures.tapio_malformed_events_total: malformed/truncated records dropped by userspace.tapio_correlation_drops_total: intentionally dropped ambiguous evidence.tapio_drain_cap_total: drain loops that hit the per-tick cap.tapio_sink_writes_total: sink write attempts by sink and result.tapio_sink_drops_total: sink events dropped by sink and reason.tapio_controller_send_failures_total: failed hello, heartbeat, or event sends to the controller.tapio_config_fetch_total: controller config poll outcomes.
There is no build script for production deployment. Compile the four eBPF C programs with clang and place them in the agent --ebpf-dir.
for prog in network_monitor container_monitor storage_monitor node_pmc_monitor; do
clang -O2 -g -target bpf -D__TARGET_ARCH_x86 -I ebpf/headers \
-c ebpf/${prog}.c -o /opt/tapio/ebpf/${prog}.o
doneUse -D__TARGET_ARCH_arm64 for arm64 nodes.
scripts/verify-lean.sh checks:
- formatting;
- clippy;
- tests;
- release binary budgets;
- dependency boundaries;
- eBPF object budgets when Linux headers are available;
- eBPF map count and
max_entriesbudgets.
Current budget model:
| Binary | Budget |
|---|---|
tapio-agent |
target 1.25 MB, hard 1.5 MB |
tapio |
hard 900 KB |
tapio-controller |
reported, no hard budget yet |
The script writes dependency snapshots under /tmp/tapio-lean. Budget increases should be explicit and justified.
The current agent/controller boundary is agent-initiated HTTP/1.1 plus JSON using tapio-wire/v1.
Controller endpoints:
POST /v1/agents/helloGET /v1/agents/configPOST /v1/agents/heartbeatPOST /v1/events
The controller is the HTTP server. The agent does not expose an inbound controller API and does not use gRPC on this path.
See:
Observer behavior is runtime-configurable through a versioned agent-to-kernel ABI, without eBPF reloads:
EvidenceProfile YAML
-> tapio-profile validates and compiles
-> CompiledConfig (tapio-wire)
-> tapio-controller distributes
-> tapio-agent writes tapio_config map carriers
-> eBPF programs read primitive flags and thresholds
-> emitted events carry config_generation
-> agent heartbeats report the applied config hash
The kernel side is deliberately dumb: a fixed-layout, version-checked struct tapio_config per observer object. All-zeros (the kernel's cold-start map state) is inert — nothing emits until real config lands. On ABI version mismatch, observers stay silent instead of misreading fields.
The operator side is deliberately strict: profiles are versioned YAML documents validated against a closed schema. Unknown fields, unknown observers, and out-of-range values are rejected, not ignored. compile is infallible — every failure happens during validation.
The controller convergence signal is the compiled config hash. Agents report the hash they have actually applied in heartbeats; an empty hash means a controller-mode agent is still unconfigured.
See docs/agent-kernel-config-abi.md.
tapio-agent/ node-local eBPF observer
tapio-controller/ cluster coordination skeleton
tapio-profile/ Evidence Profile validation and compilation
tapio-wire/ agent/controller protocol structs
tapio-cli/ CLI and MCP server for local occurrence files
tapio-common/ shared ABI structs, occurrences, events, sinks
ebpf/ eBPF C programs and headers
scripts/ lean checks, dependency checks, runtime smoke tests
docs/ architecture, agent/controller, and config ABI notes
Tapio intentionally does not:
- forward every kernel event;
- infer root cause;
- fill reasoning, explanation, remediation, or suggested-fix fields;
- store/index events long-term;
- replace Prometheus, Grafana, or OpenTelemetry;
- put Kubernetes watches in the node agent;
- expose an inbound controller API from the agent;
- use gRPC for the v0 agent/controller path;
- become a generic observability platform.
Tapio owns node-level kernel evidence. Everything else can consume that evidence later.
Apache 2.0