kars-sre demo + agent — Slices 0-4: autonomous incident triage + typed apply-fix + Telegram#397
Draft
pallakatos wants to merge 43 commits into
Draft
kars-sre demo + agent — Slices 0-4: autonomous incident triage + typed apply-fix + Telegram#397pallakatos wants to merge 43 commits into
pallakatos wants to merge 43 commits into
Conversation
PR #1 in the kars-sre/demo-and-agent series — Slice 0 of the SRE proposal: the demo can now be walked end-to-end by hand before any SRE plugin code lands. Each subsequent slice (S1 read-only tools, S2 K8s diag toolset, S3 typed apply-fix, S4 proactive watcher) replaces one hand-walked step with an autonomous one. Scenario: 'platform team's GitOps refactor lands a tight ResourceQuota across every workload namespace; the quota's requests.memory ceiling (50Mi) is lower than what the research sandbox actually requests. The pod stays Running until anything triggers a reschedule — then it goes Pending forever because the quota blocks pod admission.' Why infrastructure, not image-tag: image tags don't change on a running pod for random reasons. ResourceQuota mis-configuration is a real GitOps-collision incident that operators hit regularly. Files: agent-a-research.yaml — KarsSandbox 'research' (Hermes runtime, mirrors exec-brief-hermes- single shape, simplified to two CRs so the demo focuses on the runtime) platform-hardening-quota.yaml — the bad ResourceQuota the break script applies; deliberately NOT labeled kars.azure.com/managed-by so the SRE's DeleteResourceQuota typed action is permitted break.sh — applies the quota, force-deletes the running pod, confirms the FailedCreate event surfaces reset.sh — deletes the quota and waits for Running 2/2 (manual recovery path) runbook.md — presenter script for walking Act II by hand until S2 ships; once S2 ships, the runbook becomes the expected-behaviour spec for the autonomous agent walk Proposal update: §7.7.1 — adds DeleteResourceQuota as a typed action (namespace- scope, requires the ResourceQuota NOT carry the kars.azure.com/managed-by=controller label so kars-owned governance quotas stay protected and only operator-applied platform quotas are deletable) §7.7.1 — removes the PatchSandboxRuntimeImage carve-out from the previous draft; the demo no longer requires writes to kars.azure.com/* CRs, so the no-governance-mutation rule stays absolute Validation: python3 -c yaml.safe_load_all on both YAMLs — parses OK bash -n break.sh / reset.sh — syntax OK ci/check-copyright-headers.sh — all 499 OK Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
…in containment
Slice 1 of the kars-sre demo+agent series. The agent is now installable
on any kars cluster via 'kars sre install' and reachable via 'kars sre
talk'. It reads kars CRs cluster-wide, walks the diagnostic checklist,
matches errors against the OOTB-blocker corpus, and proposes typed
fixes (apply is Slice 3).
What ships:
deploy/helm/kars/templates/sre.yaml — Gated on .Values.sre.enabled.
Creates 5 K8s objects when enabled:
- InferencePolicy 'sre-inference' (kars-system)
- KarsSandbox 'sre' (kars-system) with runtime: Hermes,
extraEnv KARS_SRE_ENABLED=true, networkPolicy.defaultDeny=true
+ allowlist contains ONLY kubernetes.default.svc (NOT
agentmesh — §7.8.6 network layer)
- ToolPolicy 'sre-tools' (kars-sre) gating the sre_* surface
- ClusterRole 'kars-sre-reader' — read on kars CRs + apiextensions
+ core workloads (RBAC per proposal §7.2.1 minus what S2/S3 add)
- ClusterRoleBinding pinned to ServiceAccount kars-sre/sandbox
(explicit subject — no group binding, no wildcard, §7.8.3)
deploy/helm/kars/values.yaml — new 'sre:' block (enabled=false default,
model=gpt-4.1, provider=azure-openai, tokenBudget=32000,
extraAllowedEndpoints commented out for Slice 4 channel wiring).
cli/src/commands/sre.ts — 'kars sre {install,uninstall,status,talk}'
subcommands. 'install' wraps 'helm upgrade --reuse-values --set
sre.enabled=true' then waits for the sandbox to reach Available.
cli/src/cli.ts — wires sreCommand() into the Operations command group.
runtimes/hermes/.../plugin/sre.py — 5 tools, all read-only:
- sre_describe_state structured snapshot of all 11 kars-owned CRs
- sre_logs apiserver-side pod log tail (cap 500 lines)
- sre_diagnose kars-CR health checklist + summary string
- sre_explain_error OOTB-blocker corpus matcher (6 known patterns
including ImagePullBackOff, exceeded quota,
OOMKilled, CrashLoopBackOff, FailedScheduling,
ContainerCreating)
- sre_propose_fix typed-action proposal envelope; Slice 1
codifies DeleteResourceQuota (the demo Act II
target) — rest of typed-action set lands in S3
runtimes/hermes/.../plugin/sre_kube.py — minimal in-cluster apiserver
client built on httpx (no new dep added to the shared Hermes image).
Reads projected SA token + ca.crt + namespace from the standard paths;
detects token rotation by content compare on each request.
runtimes/hermes/.../plugin/__init__.py — adds the KARS_SRE_ENABLED
gate. When set:
- kars_spawn family is SKIPPED at registration (§7.8.5 — SRE agent
cannot spawn sub-agents)
- kars_mesh_* family is SKIPPED at registration (§7.8.6 — SRE agent
is not on the mesh; combined with the NetworkPolicy block above
this is two of three §7.8.6 enforcement layers — the third
'separate image' layer is the §7.8.1 follow-up slice)
- kars_discover is skipped (no peers to discover)
- eager-mesh-init thread is skipped (would log noisy connection
failures otherwise)
- sre.register(ctx) runs AFTER everything else
runtimes/hermes/tests/test_sre.py — 15 tests covering:
- env-gate truthy/falsy mapping
- all 5 tools register with the correct schema
- explain_error matches against the corpus, handles no-match,
handles empty input
- propose_fix codifies DeleteResourceQuota for ResourceQuota target;
returns rationale-only envelope for other kinds
- KARS_CR_KINDS lists all 11 proposal §3.5 CRDs
- describe_state walks every kind + surfaces per-kind errors
without raising
docs/sre.md — operator-facing readme: install, talk, tool surface,
containment summary, what S1 cannot do yet, links to proposal +
Act II runbook.
Validation:
pytest tests/test_sre.py → 15/15 pass
pytest tests/test_governance.py → unchanged, pass
pytest tests/test_package_shape.py → unchanged, pass
npm run typecheck (cli) → no errors
npm run build (cli) → builds
helm lint --set sre.enabled=true → 0 fails
helm template ... --show-only sre.yaml → renders 5 objects clean
helm template ... (sre.enabled=false) → sre.yaml correctly omitted
ci/check-copyright-headers.sh → all 501 files OK
What this slice does NOT ship (per §7.1 ladder):
- K8s diag toolset (sre_image_probe, sre_endpoints_inspect,
sre_what_changed, sre_top, sre_describe_resource) — Slice 2
- Fix execution (sre_apply_fix + TokenRequest + admission VAPs) — S3
- Proactive watcher + Telegram/Slack notifications — Slice 4
- Separate kars/sre-sandbox image (§7.8.1 packaging containment) —
deferred; Slice 1 ships SRE in the shared Hermes image behind
the KARS_SRE_ENABLED env gate as a tactical bridge. The env gate
is the interim containment: tools aren't registered in any other
pod, so a request for sre_* in a standard sandbox hits 'tool not
found' at the runtime.
Next: Slice 2 (K8s diag toolset), then Slice 3 (typed apply-fix + AGT
approval flow + admission VAPs).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| import importlib | ||
| import os | ||
| import sys | ||
| from typing import Any |
…dpoints, image_probe, top
Slice 2 of the kars-sre series. Extends the read-only diagnostic
surface from kars-CR-centric (Slice 1) to arbitrary Kubernetes
workloads — everything the agent needs to diagnose the Act II
ResourceQuota incident end-to-end.
What ships (5 new tools, all read-only):
sre_describe_resource — structured-describe for any K8s kind. For
workload kinds (Deployment / StatefulSet /
DaemonSet) walks the OWNER GRAPH:
workload → ReplicaSet → matching Pods →
events on every level. One tool call returns
the whole incident picture.
sre_what_changed — events of failure-relevant reasons in last
N minutes across BOTH core/v1 and
events.k8s.io/v1. Surfaces FailedCreate,
BackOff, OOMKilling, Evicted, etc. — the
incident-framing tool.
sre_endpoints_inspect — Service → selector → matching pods →
EndpointSlice readiness. Synthesises a
finding the agent can quote (no pods match,
pods NotReady, targetPort mismatch, OK).
sre_image_probe — given an image, enumerate Pod images
cluster-wide and suggest the closest in-use
tag by Levenshtein edit-distance. Doesn't
reach out to the registry (per-registry auth
plumbing is Slice 4+); instead answers the
question that's actually most useful:
'what's the closest in-use tag on THIS
cluster right now?'
sre_top — metrics.k8s.io wrapper for CPU+memory per
pod or per node. Gracefully degrades to
{unavailable: 'metrics-server not installed'}
if the metrics API isn't registered
(proposal §7.5 Q4).
Also extends sre_propose_fix to codify two more typed actions from
proposal §7.7.1: PatchDeploymentImage and ScaleDeployment (in
addition to Slice 1's DeleteResourceQuota). Slice 3 will widen the
typed-action set further AND add the execution path.
RBAC widened in deploy/helm/kars/templates/sre.yaml:
+ discovery.k8s.io/endpointslices (for sre_endpoints_inspect)
+ metrics.k8s.io/pods, nodes (for sre_top)
+ core/nodes, endpoints, resourcequotas (cluster-wide read)
ToolPolicy extended to allow the 5 new tool names.
Containment unchanged: still gated by KARS_SRE_ENABLED env on the
SRE sandbox pod only; standard Hermes sandboxes don't see the env,
don't load the tools, can't call them.
Validation:
pytest tests/test_sre.py tests/test_sre_k8s.py → 31/31 pass
ci/check-copyright-headers.sh → all 502 OK
helm lint --set sre.enabled=true → 0 fails
python -m py_compile (sre.py, sre_k8s.py) → OK
Next: Slice 3 (typed apply-fix + admission VAPs + TokenRequest path
+ kars sre approve CLI).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
|
||
| from __future__ import annotations | ||
|
|
||
| from typing import Any |
`kars sre install` was passing the relative path 'deploy/helm/kars'
to helm, which helm parses as a chart repo name when the user's CWD
is anywhere other than the kars repo root. Result:
Error: repo deploy not found
Fixed by resolving the kars repo root the same way `kars up` does:
first walk up from the CLI file's own location (works for npm link),
then fall back to walking up from CWD looking for deploy/helm/kars.
Also: replaced the broken `.option('--wait', ..., true)` with the
commander-idiomatic `.option('--no-wait', ...)` so the wait flag
actually defaults to on.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A plain --reuse-values carries the stored release values forward
verbatim. If the stored values are older than the chart on disk
(e.g. operator ran 'kars dev' before runtimes.hermes was added to
values.yaml), the template fails with:
nil pointer evaluating interface {}.image
at controller-deployment.yaml line 89.
--reset-then-reuse-values (helm 3.14+ / helm 4) re-loads the chart's
values.yaml defaults first, then overlays the previously --set values
on top. So new chart fields get their defaults populated, while user
overrides for older fields are preserved.
Applied to both install and uninstall sub-actions.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The ToolPolicy 'sre-tools' lives in namespace kars-sre by design
(kars's cross-namespace ToolPolicy refs are deliberately not
supported — principles.md §3). But the controller-created
kars-sre namespace only exists AFTER the KarsSandbox 'sre' is
reconciled, which is AFTER helm tries to apply the ToolPolicy.
Error: UPGRADE FAILED: failed to create resource:
namespaces "kars-sre" not found
Fix: add the Namespace as a chart-managed resource at the top of
sre.yaml. The controller's namespace-reconcile path uses server-side
apply, so it will harmlessly co-own this namespace (adding its
own labels + annotations) when it reaches reconciler/mod.rs step 1.
No conflict.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Helm 4 uses server-side apply by default. When prior
`kubectl set image` / `kars push --apply` runs took ownership of
fields that the chart now also wants to manage, the SSA call fails
with:
conflict with "kubectl-set" using apps/v1:
.spec.template.spec.containers[name="controller"].image
--force-conflicts (helm 4) instructs server-side apply to take
ownership on conflict. Matches operator intent: the helm-managed
chart is the source of truth, and chart-driven upgrades should
override transient field-manager pollution from ad-hoc
`kubectl set` calls.
Confirmed via `helm upgrade --help`:
--force-conflicts if set server-side apply will force changes against conflicts
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…m), not kars-sre Controller rejected the KarsSandbox sre with: Degraded: ToolPolicyNotFound — 'sre-tools' not found in 'kars-system' (cross-namespace refs not supported) I had ToolPolicy in 'kars-sre' under the misunderstanding that it should be co-located with the runtime pod's namespace. The actual kars convention is the opposite: governance refs are namespace-local to the KarsSandbox CR's OWN namespace (kars-system in our case), per principles.md §3 cross-namespace-refs-deliberately-unsupported rule. The runtime namespace kars-sre is for the pod + RBAC, not for governance. Confirmed against the existing exec-brief-hermes-single scenario which co-locates KarsSandbox + ToolPolicy in kars-system. Net: still safe wrt §7.7.1 protected-resource denylist (kars-system is denylisted, so SRE agent can't delete this ToolPolicy even though it's not labeled kars.azure.com/managed-by=controller). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two related bugs uncovered during live test:
1) The controller silently strips user-supplied extraEnv keys with
reserved prefixes (mod.rs:1583 — AGT_, AZURE_, FOUNDRY_AGENT_,
IMDS_, KARS_). KARS_SRE_ENABLED was being dropped, so the plugin
never registered.
Fix: rename to SRE_ENABLED across:
- runtimes/hermes/.../plugin/sre.py (is_enabled)
- runtimes/hermes/.../plugin/sre_k8s.py (module docstring)
- runtimes/hermes/.../plugin/__init__.py (log line + docstring)
- runtimes/hermes/tests/test_sre.py (3 env patches)
- deploy/helm/kars/templates/sre.yaml (extraEnv key + comment)
2) During the rename edit, the `extraEnv:` block ended up under
`runtime:` instead of `runtime.hermes:` (4-space vs 6-space indent),
producing:
UPGRADE FAILED: .spec.runtime.extraEnv: field not declared in schema
Fix: restore correct 6-space indent so extraEnv nests inside hermes.
Long-term fix (deferred): controller should detect
kars.azure.com/role=sre label on the KarsSandbox and inject
KARS_SRE_ENABLED itself (controller-side injection bypasses the
prefix filter). Noted inline at sre.is_enabled() docstring and in
the sre.yaml extraEnv block as a follow-up.
Tests: 31/31 pass (test_sre.py + test_sre_k8s.py).
Live verification: SRE_ENABLED env appears on agent container's env;
helm upgrade succeeds; chart re-applies cleanly.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Slice 1 template hardcoded requirePromptShields: true on the
SRE InferencePolicy. Azure OpenAI deployments only carry
'prompt_filter_results' in responses when an explicit Content
Filter policy is attached to the deployment. Bare local-dev
deployments (Foundry quickstart, gpt-4.1 without explicit filter)
don't emit those annotations — so the router blocks every response
with:
Response blocked: InferencePolicy requires Prompt Shields but
the upstream response carried no prompt_filter_results annotations
Diagnosed live during kars sre talk session — first prompt ('hi
there') returned a cached greeting that happened to bypass the
check, second prompt died.
Fix: default false in values.yaml + chart; operators wiring
Content Safety in production can set:
--set sre.requirePromptShields=true
(or values.yaml override).
The SRE agent's threat surface is operator-driven Kubernetes
diagnosis, not user-facing chat, so prompt-shield enforcement is
less critical than for an internet-facing assistant. Operators who
need it can opt back in.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switch default model so the SRE agent ships with current frontier out of the box. Operator can still override per-install with `kars sre install --model <name>`. The model name must match an Azure OpenAI deployment in the operator's Foundry project — InferencePolicy routes to that deployment via the router; if the deployment doesn't exist the router returns a clear 404 and the sandbox surfaces Degraded. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Hermes uses plugin.yaml's provides_tools list as the gate for
ctx.register_tool() calls — tools not declared in the manifest are
silently rejected at registration time. So even though sre.register()
called register_tool() for all 10 sre_* tools, none of them became
callable.
Diagnosed via live test:
hermes tools list → showed foundry_*, http_fetch, kars_handoff_status
(the manifest-declared ones)
→ NO sre_* (registered at runtime, manifest-rejected)
Same pattern as the OpenClaw plugin's contracts.tools requirement
(see memory: 'OpenClaw 2026.5.x requires plugin manifest to declare
contracts.tools listing every tool the plugin will register').
Fix: add all 10 sre_* tools (5 Slice 1 + 5 Slice 2) to provides_tools.
The tools remain conditionally registered at runtime — standard Hermes
sandboxes don't set SRE_ENABLED → sre.register(ctx) is skipped → the
tools are declared-but-not-callable (still matches the manifest
contract; Hermes treats them as 'present but inactive').
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three correctness fixes landed during the live test pass:
1) Hermes register_tool kwargs were wrong
sre.py + sre_k8s.py used parameters=... but Hermes' contract expects
schema=... AND toolset="<name>". Without these the manifest's
provides_tools entries still showed up but the tools were silently
non-callable. Fixed all 10 sre_* register_tool calls.
2) plugin.yaml provides_tools missing the sre_* entries
Hermes' plugin loader requires every tool the plugin will register
to be declared in provides_tools (same shape as OpenClaw's
contracts.tools). Added all 10. Conditionally registered at
runtime via SRE_ENABLED — standard sandboxes don't trip them.
3) New: kars-sre persona / system prompt
Following the OpenClaw pattern (sandbox-images/openclaw/entrypoint.sh
:1214 writes SOUL.md on every boot), the Hermes entrypoint now
writes a 110-line SRE-specific SOUL.md to $HERMES_HOME/SOUL.md
when SRE_ENABLED=true. Content:
- Identity + mission statement
- Tone constraints (concise, evidence-based, direct, honest)
- Catalog of all 10 sre_* tools with WHEN to use each
- Catalog of tools the agent does NOT have (spawn, mesh, shell,
external net) with rationale
- Standard incident reasoning loop (5 steps)
- Output structure for fix proposals (Symptom/Evidence/Root cause/
Proposed fix/Why safe/Rollback)
- Boundaries (protected-resource denylist enforced at proposal
layer; agent should not even try)
- Audit info (where the kars audit JSONL captures every call)
- First-message greeting template (one line, no editorialising)
The model name interpolates from KARS_MODEL → AZURE_OPENAI_DEPLOYMENT
→ 'gpt-5.4' default, so the prompt always names the live model.
Validation:
pytest tests/test_sre.py tests/test_sre_k8s.py → 31/31 pass
bash -n entrypoint.sh → clean
live verify: SOUL.md written 110 lines, model = gpt-5.4
live verify: hermes tools list → '✓ enabled sre' toolset now shows
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds two iptables rules to the egress-guard init container, gated on
the kars.azure.com/role=sre label being present on the KarsSandbox:
1. Filter chain: ACCEPT for UID 1000 -> KUBERNETES_SERVICE_HOST:443
(BEFORE the existing catch-all DROP).
2. NAT chain: RETURN for UID 1000 -> KUBERNETES_SERVICE_HOST:443
(BEFORE the existing :443 REDIRECT to :8444 transparent proxy).
Both are required. The NAT-bypass alone is not sufficient because
the filter chain runs AFTER NAT - the NAT-RETURN says 'don't redirect'
but the filter-chain DROP next would still slay the packet. Discovered
live during testing: the curl-to-apiserver hung until both rules
landed.
Why this is needed: the SRE plugin's K8s API client (sre_kube.py in
the Hermes runtime) needs DIRECT apiserver access with its projected
ServiceAccount token to read kars CRs / pods / events. Without the
bypass, every apiserver call gets NAT-redirected to the router's :8444
transparent proxy, which has no idea how to forward TLS to the
apiserver -- connections hang then time out.
Why only role=sre sandboxes: every other sandbox kind goes through
the router unchanged -- that's the whole point of the transparent
proxy + L7 audit. Direct apiserver access is the deliberate
exception, uniquely held by the nominated SRE sandbox per the
proposal section 7.8 containment design.
K8s audit log is the audit surface for these apiserver calls (the
router's L7 audit doesn't apply, but K8s audit is stronger -- every
call carries the SA identity, verb, and resource).
Implementation:
- new build_egress_guard_command(is_sre_sandbox: bool) helper
in reconciler/mod.rs that emits the right rule sequence per mode
- 3 unit tests: standard has no bypass; SRE has NAT bypass before
REDIRECT AND filter ACCEPT before DROP; both modes keep DROP
Validated end-to-end:
- HTTP 200 in 17ms from agent container -> 10.96.0.1:443
- sre_describe_state() returns 10 KarsSandboxes + all 11 CR kinds
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icies)
The Slice 1 inline AGT profile used the wrong schema — version: 1
with rules[].match.tool — which produced:
ToolPolicy sre-tools: invalid YAML: missing field agent
at compile time, then 'router has not yet loaded AgtProfile' at the
sre pod's policy loader. The sre KarsSandbox showed Degraded with
ToolPolicyNotCompiled.
Found by the SRE agent itself during the first cluster-health-overview
test (a beautifully on-point sre_diagnose result that flagged its
own ToolPolicy as the only Degraded thing in the cluster).
Right schema (from deploy/helm/kars/files/kars-default-agt-profile.yaml):
version: '1.0'
agent: <name>
policies:
- name: ...
type: capability
allowed_actions: [...]
denied_actions: [...]
priority: N
Action prefix convention used by the router:
tool:<tool_name> for tool calls
inference:<api>:<model> for model dispatch
spawn:* / mesh:* for sub-agent + mesh
The new sre-tools profile has three policies:
- sre-diagnostic-tools-allow (priority 100): all 10 sre_* tools
- sre-inference-allow (priority 90): chat_completions / responses /
content_safety
- sre-spawn-and-mesh-deny (priority 110): defense in depth for the
§7.8.5/§7.8.6 containment (already enforced by plugin not even
registering these tools)
After re-apply + sre pod restart:
ToolPolicy sre-tools status: Ready True:RouterEnforcing
KarsSandbox sre status: Running
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…shape
The Slice 1 allow rules used literal 'tool:sre_<name>' strings but the
Hermes plugin governance hook actually emits 'tool:<name>:<first-arg>'
— with a trailing colon even when no significant arg is present (see
runtimes/hermes/.../plugin/governance.py _action_verb tail returns
f'tool:{tool_name}:'). So:
literal allow: 'tool:sre_describe_state'
router emit: 'tool:sre_describe_state:' <-- no match → denied
The agent helpfully diagnosed itself via:
sre_describe_state -> blocked by policy 'sre-diagnostic-tools-allow'
(visible because the WebUI surfaced the matched_rule name). Confirmed
the action shape in inference-router/src/routes/governance.rs:66
('if let Some(tool_name) = action.strip_prefix("tool:")...').
Fix: add a '*' wildcard to every allowed_action for the sre_* tools.
This matches both the trailing-colon shape (tools with no args) and
the suffix-args shape (sre_describe_resource:<name>, sre_logs:<pod>,
etc.) in a single entry.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The egress-guard iptables bypass (b25f41b) lets UID 1000 reach the apiserver at the iptables layer, but the pod-level NetworkPolicy was still denying it. The blanket :443 egress rule explicitly excludes RFC1918 ranges to prevent lateral movement to in-cluster Services, but every cluster's apiserver ClusterIP IS in one of those ranges (kind: 10.96.0.1, AKS: 10.0.0.1, EKS: 172.20.0.1). Fix: when role=sre, add a NetworkPolicy egress rule for the apiserver Service ClusterIP. The IP + port are read at reconcile time from the controller's own KUBERNETES_SERVICE_HOST / KUBERNETES_SERVICE_PORT_HTTPS env vars (kubelet-injected on every pod). This is cluster-portable — kind, AKS, EKS, custom service-CIDRs all get the right value automatically. No hardcoded IPs. Implementation: - Top of reconcile(): compute is_sre_sandbox once + read apiserver IP/port from env. Threaded through both the egress-guard helper and the NetworkPolicy egress vec. - egress_rules.push(...) added after the static block, gated on is_sre_sandbox, with IP/port substituted from env. - Removed the duplicate is_sre_sandbox compute lower in reconcile() that was added in b25f41b — single source of truth now. Validated live: - kubectl get netpol -n kars-sre shows the 10.96.0.1/32 :443 rule - sre_describe_state() returns in 0.10s — 11 CR kinds, 10 KarsSandboxes enumerated, NO timeouts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two admission rejections:
1) spec.governance.toolPolicyRef.name required when governance.enabled=true
Added a research-tools ToolPolicy with allow rules for:
- inference:chat_completions:* / responses:* / content_safety:*
- tool:http_fetch:* (the agent does web research)
- tool:foundry_* family (memory + web_search + code_execute etc.)
2) spec.runtime.hermes must be set iff kind=Hermes (CEL guard rejects
missing key, accepts empty object). The previous manifest had a
commented placeholder which yamllint-fine but admission saw the key
as missing. Changed to 'hermes: {}' — empty object honours image
defaults without drift.
Also: aligned the demo with the SRE sandbox defaults shipped earlier:
- deployment: gpt-5.4 (was gpt-4.1)
- requirePromptShields: false (was true — bare local Foundry deployments
don't emit prompt_filter_results, blocking every response)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Controller stamps pods with kars.azure.com/component=sandbox not the app.kubernetes.io/component=sandbox the script was looking for. Result: 'no sandbox pod found to evict; quota will only manifest on next natural restart' — the script kept going but the break never surfaced. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…legram)
Slice 3 — typed apply-fix path (operator-approved remediation)
Adds the KarsSREAction CRD and reconciler that drives an SRE-agent
fix proposal Proposed → Approved → Applied → Recovered. The agent
emits a CR via sre_propose_fix; the operator approves via kars sre
approve <id> (or kubectl edit); the controller mints a one-shot
ClusterRoleBinding scoped to the right writer ClusterRole
(kars-sre-writer-quotas | kars-sre-writer-workloads), executes the
typed action via SSA, tears the binding down, and observes recovery
by polling the target namespace for failure-class events. Terminal
CRs (Recovered / Failed / Expired / Rejected) auto-GC after 1h.
Closed set of typed actions per proposal §7.7.1:
- DeleteResourceQuota (refuses kars.azure.com/managed-by=controller)
- PatchDeploymentImage, ScaleDeployment (clamp 0..50),
RolloutRestart (Deployment/StatefulSet/DaemonSet), DeletePod
New files:
- controller/src/kars_sre_action.rs (CRD types)
- controller/src/kars_sre_action_reconciler.rs (state machine)
- deploy/helm/kars/templates/crd-karssreaction.yaml
Hermes plugin (sre_propose_fix is now a CR-creator):
- Tolerant arg parsing: target.kind / action_type / inferred kind
- schema marks target.kind required + enum-validated
- Returns action_id + ready-to-paste 'kars sre approve' command
- Clear cr_error when no typed fix could be inferred
CLI:
- kars sre approve <id> / reject <id> / actions / show <id>
- kars sre show renders diagnosis + rationale + condition stamps
RBAC additions (controller-side):
- karssreactions (full r/w)
- resourcequotas: delete (the §7.8.4 escalation check requires the
controller to hold the verbs it grants in the one-shot CRB)
- apps/statefulsets,daemonsets: patch (RolloutRestart targets)
- events: list/watch/get (recovery observer)
- serviceaccounts/token: create (lands the §7.8.4 TokenRequest path)
- clusterrolebindings: create/delete kars-sre-write-*
Slice 4 — proactive watcher + Telegram
sre_watcher.py runs alongside the Hermes gateway when SRE_ENABLED=true
and a channel is configured. Polls K8s events every 10s for failure-
class reasons in kars-* namespaces (excluding kars-sre / kars-system
/ kube-* / agentmesh / default), maps each into a typed-fix target,
and on incident:
1. Reuses any open KarsSREAction with the same (action_type, ns,
name) target — no duplicate CRs.
2. Otherwise creates a new KarsSREAction with ttl_minutes=30.
3. Coalesces a per-iteration burst into ONE detailed Telegram
message (highest-priority candidate) plus an optional summary
tail ('+N other incidents: 2 FailedScheduling, 1 BackOff').
4. Sliding-window rate limit: max 4 messages/min cluster-wide.
Dedupe is bootstrapped from existing KarsSREActions on boot (survives
pod restart). First iteration is silently absorbed (priming) so a
pod re-roll doesn't replay the warm-cache flood as alerts. Periodic
60s CR resync REPLACES the dedupe state so operator-side delete
clears the in-memory map naturally.
ReplicaSet/Pod hash suffixes are normalised in the dedupe key so a
flapping Deployment's rollout sequence collapses to one alert
instead of one alert per pod-template-hash.
Telegram wiring:
- Channel adapter libraries (python-telegram-bot 21, slack-sdk 3,
discord.py 2) pre-installed in the runtime image so credentials
in the sandbox-credentials secret 'just work'.
- entrypoint.sh exports HTTPS_PROXY=http://127.0.0.1:8444 and
NO_PROXY=$KUBERNETES_SERVICE_HOST,127.0.0.1,localhost,.svc.cluster.local
so the gateway's outbound HTTPS reaches the inference-router's
forward proxy (egress-guard iptables redirect doesn't fire in
kind clusters without CAP_NET_ADMIN — explicit env covers both).
- HOME=/sandbox export so gateway-locks dir under ~/.local/state
is writable on the distroless base.
- TELEGRAM_ALLOWED_USERS exported (not just config-set) so the
gateway's per-platform allowlist skips pairing for known users.
- TELEGRAM_HOME_CHANNEL set to first TELEGRAM_ALLOW_FROM id so
'hermes send --to telegram' resolves without explicit chat id.
Operator install path (unchanged — uses existing kars credentials):
kars credentials update sre --telegram-token <T> --telegram-allow-from <ID>
Tests: 31 hermes tests + 847 rust tests + cli typecheck/lint pass.
The phase taxonomy guard now passes after refactoring the reconciler
to use named constants for all condition types / reasons / event
reasons rather than 'Failed' / 'Pending' / 'Degraded' literals.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| .for_each(|res| async move { | ||
| match res { | ||
| Ok(_) => {} | ||
| Err(e) => tracing::warn!(err = ?e, "KarsSREAction reconciler stream error"), |
Adds the SRE engineer's dedicated console as a top-level sidebar
branch in the kars Headlamp plugin. Replaces the prior workflow of
'kubectl get karssreactions + paste action_id into kars sre approve
in a terminal' with one click in the dashboard.
New routes:
/kars/sre — SRE Console (live cards, primary landing)
/kars/sre/chat — embedded Hermes WebUI iframe
/kars/karssreactions — full CRD list (under existing CRD section)
SRE Console layout (top → bottom):
🔴 Pending Approval — KarsSREActions awaiting operator. Inline
Approve / Reject buttons PATCH .spec.approval.state directly
via Headlamp's KubeObject.patch(), with optional rejection-
reason prompt. No terminal hop needed.
🔄 In-flight — actions the controller is currently executing
(Applied + waiting for recovery). Shows phase + age.
📊 Cluster Health — sandbox phase counts + degraded count.
🚨 Active Incidents — failure-class events (FailedCreate,
BackOff, FailedScheduling, Failed, ImagePullBackOff,
CrashLoopBackOff, OOMKilling, Evicted, FailedMount) from
kars-* namespaces in the last 15 min. Same filter the
proactive watcher uses, so what the operator sees here is
what the watcher would alert on.
✅ Recent — Recovered / Failed / Expired / Rejected actions
from the last hour for post-incident review.
All cards live-update via Headlamp's useList() (watch + long-poll),
so the Proposed → Approved → Applied → Recovered walk is visible
without F5. The KarsSREAction CRD is added to the existing CRD
registration table so the standard list / detail pages 'just work'
under /kars/karssreactions/:ns/:name.
SRE Chat is an iframe of the Hermes WebUI:
- tab 1: http://localhost:18789 (requires 'kars connect sre --web'
in another terminal — populates the iframe via port-forward)
- tab 2: apiserver service-proxy fallback for in-cluster operators
- 'Open in new tab' button if iframe sandboxing breaks the embed
Helm chart: SRE sandbox's allowedEndpoints now includes
api.telegram.org / core.telegram.org cluster-side so the Slice 4
watcher's outbound Telegram alerts don't need an out-of-band
NetworkPolicy patch. Dormant when Telegram isn't configured — the
gateway only opens the channel when the token is present.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines
+48
to
+59
| import { | ||
| Button, | ||
| Chip, | ||
| Stack, | ||
| Tab, | ||
| Tabs, | ||
| TextField, | ||
| Dialog, | ||
| DialogTitle, | ||
| DialogContent, | ||
| DialogActions, | ||
| } from "@mui/material"; |
…d' CTA
Two fixes:
1. ReferenceError: require is not defined
The Active Incidents card lazily resolved the Event class via
require("@kinvolk/headlamp-plugin/lib/K8s/event"). Headlamp ships
plugin bundles as pure browser ESM modules — require() doesn't
exist in that context, so the page crashed at first render. Switch
to the documented public re-export via the K8s namespace
(`import { K8s } from "@kinvolk/headlamp-plugin/lib"` →
`K8s.event`), which is safe in both build- and run-time.
2. Empty-state CTA when kars-sre isn't deployed
Both SREConsole and SREChat now check for the existence of the
sre KarsSandbox in kars-system. If absent (or the list is still
loading), they render an actionable install card with:
- `kars sre install` (the one-liner that enables the chart)
- `kars credentials update sre --telegram-token ...` (optional)
So a fresh kars dev cluster that hasn't run `kars sre install`
yet doesn't show 'No items' or a spinning iframe — it tells the
operator exactly what to type. The cards rehydrate live once the
sandbox lands (no refresh needed).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…exposed
The Headlamp 0.41 host runtime exposes `pluginLib.K8s` as a flat
namespace of class-kind classes but does NOT expose the v1 `event`
sub-namespace. Importing it via either explicit submodule path
(`@kinvolk/headlamp-plugin/lib/K8s/event`) or the top-level barrel
(`K8s.event.default`) trips Vite's UMD wrapper into its CJS-fallback
branch on first execution, which crashes the browser with:
ReferenceError: require is not defined
at ct (//plugins/kars/dist/main.js:3:52537)
(`ct` was the INCIDENT_REASONS set at top-level — top-level
execution failed before any component mounted.)
The KarsSREAction CR cards above already surface every incident
the proactive watcher catches (same dedupe key, same target shape),
so for Slice 4 the operator doesn't need the raw events feed
duplicated in the dashboard.
Slice 4.1 (future) can resurrect this via direct fetch() to
/api/v1/events through the headlamp apiserver proxy, bypassing
the K8s.event class entirely.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Headlamp keys plugins by package.json version. A pure dist/main.js swap (with same version) leaves the host's plugin loader cache holding the previous bundle. Bumping minor → operator's browser re-fetches main.js on next mount even without Cmd+Shift+R. v0.5.1 → v0.6.0 covers the prior session's additions: - KarsSREAction CRD list / detail - SRE Console + Chat sidebar branch - browser-ESM safety pass (no require() in source) - SRE-not-installed empty-state CTA
…ev / fresh)
The 'no deployed releases' error happened because 'kars dev --target
local-k8s' deploys the chart via 'helm template | kubectl apply'
(see cli/src/commands/dev/local-k8s.ts:794), so no helm release
record exists. The sre install path assumed a helm release and
failed on a fresh kars dev cluster.
Now detects three shapes:
A. helm release present
→ helm upgrade --reset-then-reuse-values --force-conflicts
(preserves operator's prior --set choices)
B. no helm release BUT controller deployed (= kars dev path)
→ helm template … | kubectl apply --server-side --force-conflicts
(mirrors how the chart got there in the first place)
C. neither (= fresh cluster)
→ helm install --create-namespace --take-ownership
(--take-ownership: adopt any pre-existing namespace or
NetworkPolicy from prior partial installs; helm >= 3.17)
The template path uses --include-crds so KarsSREAction is installed
on first sre install even when the cluster predates Slice 3. All
three paths set azure.workloadIdentity.clientId=dummy for local-k8s
brand-new installs (real AKS installs go through kars up).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The proxy URL hardcoded 'kind-kars-dev' as the cluster name, which only worked for the local-k8s demo path. Real operators have any context name (AKS managed-cluster names, in-cluster Headlamp, multi-cluster setups). The 'key not found' error was Headlamp's backend rejecting the request because the cluster path component didn't match any of the operator's configured contexts. Fix: parse the cluster name from window.location.pathname (Headlamp routes every cluster-scoped view under /c/<cluster>/...). When the parse fails (e.g. the Chat page is loaded outside a cluster context), the proxy tab is disabled and the operator is steered to the local port-forward tab. Reads location directly instead of useCluster() because importing the K8s namespace (where useCluster lives) trips the host's UMD require() fallback — the same crash the v0.6.0 plugin fixed. v0.6.0 → v0.6.1 to bust Headlamp's plugin cache. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The per-sandbox Service exposed only :8443 (inference-router). For Hermes runtimes the gateway WebUI / inbound channel adapter lives on the agent container at :18789, but operators had no way to reach it without setting up a per-sandbox port-forward. Now: when runtime.kind == Hermes, the controller appends a 'gateway' port (18789) to the same Service. Result: 'kubectl port-forward svc/<name> 18789' works directly, AND the Headlamp SRE → Chat tab can route via the apiserver service proxy: /clusters/<cluster>/api/v1/namespaces/kars-sre/services/sre:18789/proxy/ OpenClaw runtimes are unaffected (no gateway port added). The NetworkPolicy ingress rule for governance-enabled sandboxes already allows port 18789 from peer sandbox namespaces, so this purely widens what the cluster apiserver / Headlamp backend can reach — no extra exposure to other sandboxes.
Hermes is a CLI/TUI agent — there's no embedded WebUI to iframe.
Earlier commits attempted an apiserver-proxy iframe pointing at
:18789 (Hermes admin port) — which only listens when the gateway
runs in channel mode, and even then doesn't serve a browser UI.
The SRE Chat page now shows three explicit operator paths in
copy-pasteable code blocks:
1. kars sre talk → kubectl exec REPL (live triage)
2. kars credentials update sre --telegram-token …
→ wire Telegram for proactive alerts
3. kars sre status / actions / show <id>
→ terminal-friendly snapshot
Plus a link back to /kars/sre (the Console) for the live approval
queue + cluster health cards. The 'iframe with connection refused'
error is gone; v0.6.1 → v0.6.2 to bust the host's plugin cache.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The internal <Link routeName="kars-sre-console"> in SREChat may have been the source of the React error 310 — Headlamp's Link implementation uses hooks internally and a routeName resolution miss can fire conditional hook paths. Using a plain <a> anchor with the canonical Headlamp URL avoids that branch entirely. The bundle was also showing as stale (browser cached old dist) — v0.6.3 bumps the version to force a re-fetch on the host's plugin loader. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces the 'no embedded WebUI' instruction page with a real iframe
into the Hermes dashboard — an in-browser xterm.js PTY chat. The
operator can now talk to the SRE agent without leaving Headlamp.
How it works:
1. sandbox image: pip-installs FastAPI + uvicorn + websockets +
ptyprocess (the soft-optional deps hermes dashboard needs).
Upgrades hermes-agent from 0.15.2 → 0.16.0 to pick up the
dashboard_auth submodule that 0.15.2 was missing.
2. entrypoint.sh: launches 'hermes dashboard --host 0.0.0.0
--port 9119 --no-open --insecure --skip-build' alongside the
gateway when SRE_ENABLED=true. HERMES_DASHBOARD_TUI=1 enables
the embedded PTY tab. Opt-out via HERMES_DASHBOARD_ENABLED=false.
3. controller: adds containerPort 9119 ('dashboard') to Hermes
agent containers, and exposes it on the per-sandbox Service so
the cluster apiserver proxy can reach it.
4. Headlamp plugin: SREChat replaces the instruction page with an
iframe pointing at
/clusters/<cluster>/api/v1/namespaces/kars-sre/services/sre:9119/proxy/.
Includes 'Open in new tab' fallback for cases where the sub-
path proxy strips Hermes web bundle asset paths. v0.6.3 → v0.7.0
to bust the host's plugin cache.
The '--insecure' flag is required when binding off-loopback inside
the pod — Hermes refuses non-127.0.0.1 binds without it. In our pod
the only reachers are the apiserver proxy + peer sandboxes (both
gated by RBAC + NetworkPolicy), so 'insecure' here doesn't mean
externally exposed.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two changes work together to make the in-browser Hermes dashboard
load inside the Headlamp SRE Console iframe:
1. New runtimes/hermes/src/kars_runtime_hermes/dashboard_proxy.py
Tiny FastAPI middleware wrapper around hermes_cli.web_server.app.
Installs X-Forwarded-Prefix on every request from the
HERMES_DASHBOARD_PREFIX env var. Hermes' dashboard reads that
header to rewrite absolute asset URLs (/assets/...) for sub-path
reverse proxies. K8s apiserver service proxy doesn't inject that
header, so without this wrapper the SPA blank-loads in the iframe.
2. entrypoint.sh now boots that wrapper instead of 'hermes dashboard',
with HERMES_DASHBOARD_PREFIX set to the K8s apiserver suffix:
/api/v1/namespaces/<ns>/services/<svc>:9119/proxy
3. Headlamp SREChat fetches the dashboard HTML up front via the
Headlamp proxy, rewrites asset paths to include /clusters/<cluster>
(the Headlamp-added prefix that the in-pod wrapper can't know
about), and injects via iframe srcDoc. Also injects <base href>
so the SPA's relative fetch() calls resolve under the proxy.
v0.7.0 → v0.7.1 to bust the host's plugin cache.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ollision The K8s apiserver-proxy URL prefix /api/v1/namespaces/<ns>/services/<svc>:<port>/proxy starts with /api/v1 — which collides with Hermes' own /api/* route namespace. So when the browser fetched a SPA asset like /api/v1/namespaces/kars-sre/services/sre:9119/proxy/assets/index.js, FastAPI matched it to its API router (401 Unauthorized) instead of the static-file mount. Fix: extend dashboard_proxy.py middleware to STRIP the prefix from scope["path"] before FastAPI sees the request, while still injecting X-Forwarded-Prefix so the SPA's index.html bootstrap rewrites asset URLs with the absolute prefix. Result: browser fetches .../proxy/assets/foo.js, middleware strips → FastAPI sees /assets/foo.js → static-file mount serves it → 200 OK. Smoke test verified end-to-end: asset via prefix: HTTP 200 index via prefix: HTTP 200 Headlamp SREChat still uses srcDoc + double-prefix rewrite because Headlamp's apiserver proxy adds /clusters/<cluster> ON TOP of the K8s suffix — the in-pod wrapper can't know <cluster>, so the browser-side rewrite adds it. v0.7.1 → v0.7.2 to bust the host's plugin cache. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three stacked bugs blocked the SRE Console's Chat tab from working end-to-end. All fixed: 1. Headlamp's apiserver proxy demands Authorization: Bearer on every /clusters/<c>/api/v1/.../proxy/* call. Headlamp's SPA fetch wrapper attaches it; iframe asset loads bypass the wrapper and 403 as system:anonymous. Plugin v0.7.4 drops the apiserver-proxy approach entirely and iframes http://localhost:19119/ via a user-run port-forward. Cross-port = different origin so parent/child JS is isolated, but iframe document loads aren't same-origin-gated. 2. The dashboard_proxy wrapper bypasses Hermes' start_server() (to install X-Forwarded-Prefix middleware first), which is where Hermes sets app.state.bound_host/port. Without those, _build_gateway_ws_url returned None and the PTY-spawned hermes --tui child got no HERMES_TUI_GATEWAY_URL env var — accepting keystrokes but with nowhere to send them. _set_bind_state() mirrors what start_server does. 3. Azure Linux 3 ships Node 24; Hermes' ui-tui esbuild bundle was built against Node 22 and SIGSEGVs immediately on Node 24 (380MB core dumps). Dockerfile now pins Node 22.20.0 at /opt/node22/, entrypoint exports HERMES_NODE=/opt/node22/bin/node so Hermes' _node_bin() picks it up. Plus: - model.context_length: 200000 pinned so cold-start skips the slow /v1/models probe. - GATEWAY_ALLOW_ALL_USERS=true on the SRE sandbox so the single-operator loopback deploy doesn't drop our own messages. - entrypoint passes HOME/HERMES_HOME/HERMES_NODE through runuser's env reset via explicit env VAR=$VAR invocation. Plugin bumped to 0.7.4. Verified end-to-end: chat opens, accepts keystrokes, agent responds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two fixes that surfaced during demo dry-run:
1. ACR typo (introduced in 5c87e9de during the azureclaw\u2192kars rename)
- 'kars.azurecr.io' was a search/replace artifact from 'azureclaw.azurecr.io';
the actual ACR is 'karsjpdyyv.azurecr.io' (azd-suffixed). The canonical
name we use in chart values + controller defaults is 'karsacr.azurecr.io'
so operators have ONE name to re-publish to.
- Symptom: when an existing sandbox spawned a sub-agent via kars_spawn, the
controller minted the new Deployment with 'kars.azurecr.io/openclaw-sandbox:latest'.
Kubelet did DNS on 'kars.azurecr.io' \u2192 NXDOMAIN \u2192 ImagePullBackOff loop.
- Fixed in:
- deploy/helm/kars/values.yaml (4 sites: controller, inference-router,
sandbox, a2a-gateway)
- cli/src/commands/dev/local-k8s.ts (inverted target/aliases shape:
'target' is now the canonical name the controller expects, 'aliases'
are the local build tags we look up + retag from)
- tools/demo/scenarios/01-sandbox.yaml
2. Workload-aware Cluster Health (Headlamp plugin 0.7.5)
- KarsSandbox CR's 'phase=Running' fires the moment the controller
successfully reconciles the Deployment spec; it knows nothing about
whether the pods inside actually pulled their image, passed readiness,
or got OOM-killed. The old SREClusterHealthCard read phase only \u2192
'all green' even when break.sh had killed every pod.
- SREClusterHealthCard now cross-checks each sandbox against its
underlying Deployment (kars-<name>/<name>) and surfaces three buckets:
Healthy \u2014 CR Running AND availableReplicas \u2265 desired
Workload down \u2014 CR Running BUT availableReplicas < desired (the
false-green case)
CR-Degraded \u2014 CR-level Degraded=True
- Bonus: per-sandbox breakdown panel lists which ones are unhealthy
and points the operator at 'kars-<name>' namespace for pod-level
diagnosis. Matches the SRE agent's own diagnosis output.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines
+49
to
+60
| import { | ||
| Button, | ||
| Chip, | ||
| Stack, | ||
| Tab, | ||
| Tabs, | ||
| TextField, | ||
| Dialog, | ||
| DialogTitle, | ||
| DialogContent, | ||
| DialogActions, | ||
| } from "@mui/material"; |
The grafana-dashboard-configmap.yaml only wrapped grafana-dashboard-kars-fleet.json but not grafana-dashboard-kars-ops.json — even though both JSON files have lived in deploy/monitoring/ since May 27. Result: the Headlamp plugin's SandboxMetricsCard iframes a 'Dashboard not found' page (it targets uid=kars-ops). Regenerated the configmap YAML from both .json files so the grafana-dashboard sidecar picks up both on next kars dev run. No JSON content changed; just plumbing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
`hermes gateway run --accept-hooks` in idle-daemon mode (no Telegram/Slack/ Discord channels configured) runs only the cron ticker — it never imports the kars Hermes plugin, so the Phase A2.1 eager MeshClient init at plugin load never fires. Result: a Hermes sandbox is invisible on `kars_mesh_directory` listings until something else triggers a plugin load (e.g. an interactive `hermes chat` invocation, which spins up a short-lived process that registers + exits). Adds a 5-line pre-warm in entrypoint.sh that runs `_get_or_init_client()` in a short-lived background Python process at boot — register_self is idempotent + restart-safe so re-runs are cheap. Guarded on: - SRE_ENABLED != true (SRE agents are intentionally off-mesh) - KARS_MESH_PROVIDER == agt (only run when the mesh is actually wired) Verified on kind: research sandbox now logs '[kars-hermes] mesh pre-warm: registered' within ~2s of pod boot, and shows up on the AGT registry's live-agents endpoint before any chat invocation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Followup to fcce016 — the short-lived pre-warm Python process registered on the relay then EXITED, taking the MeshClient socket with it. Without a live connection there's no relay heartbeat, so the AGT registry marks the agent stale after ~90s and discovery tools hide it ('Stale/offline filtered out'). Replaces the pre-warm with a long-lived 'kars-mesh-keepalive' process that: 1. Calls _get_or_init_client() to register + connect (same eager path the plugin would take if loaded by the gateway) 2. Calls mesh_worker.start_worker() so the sandbox can REPLY to inbound mesh messages (not just appear in directory listings) — same auto-responder the controller wires into kars_spawn'd sub-agents via KARS_MESH_AUTO_RESPONDER=1 3. Parks on threading.Event().wait() forever so the MeshClient stays alive and keeps heartbeating Verified on kind: research's keepalive log shows registered + connected + worker started; dev-agent's mesh discover can now see research. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Follow-up to 163e1de — the keepalive's mesh_worker.start_worker() was draining inbound messages and silently dropping them because the worker gates LLM replies behind KARS_MESH_AUTO_RESPONDER (mesh_worker.py:259). Couldn't set the env var via the KarsSandbox CR's extraEnv because the controller's reserved-prefix guard (reconciler/mod.rs:1820) strips any user-supplied KARS_* env. Set it inline on the keepalive's exec env instead — that's the only process that runs the worker, so a process-local env var is sufficient. After this fix: dev-agent → research mesh send now triggers an actual Hermes-generated reply via the auto-responder. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 500K cap (the InferencePolicy default when dailyTokens is unset) exhausts trivially in a live demo — one 175K-context conversation through a couple of turns already crosses it, after which the inference-router throttles and the agent can't reply. The Headlamp plugin's token-budget panel renders this as '100% used', looking like a misconfiguration when it's actually intentional governance. Sets explicit 2M for research (demo scenario) and sre (Helm template default with a value-override path). Operators in production with strict cost controls can override via: --set sre.dailyTokens=N edit the research scenario yaml inline Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
e4b4949 to
94cab91
Compare
Same false-Running problem as the SRE Cluster Health card (fixed in 5f1c2ee) affected the Overview's 'Ready' headline stat and the Sandboxes list's Phase column. Both read KarsSandbox.status.phase, which the controller sets to 'Running' the moment the Deployment spec is reconciled — independent of whether the pods inside actually pulled their image / passed readiness / etc. Two visible bugs: - Overview's 'Ready' stat counted 'phase === "Ready"' but the controller never sets that — it uses 'Running'. So 'Ready' always showed 0 even with all sandboxes healthy. - Sandboxes Phase column showed 'Running' for a sandbox whose Deployment was at 0/1 available (ImagePullBackOff, OOMKilled, etc.) — directly contradicting reality. Fixes both by pulling Deployments alongside KarsSandbox and cross-checking availableReplicas >= spec.replicas before declaring a sandbox 'Healthy'. Overview headline stats are now: Healthy — CR Running AND workload available Workload down — CR Running BUT workload unavailable CR-Degraded — CR-level Degraded=True condition Sandboxes list shows 'Workload down' (red StatusLabel) in the Phase column when the underlying Deployment can't meet its replica count. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Slice 3 recovery observer declared an action 'Recovered' as soon
as there were no FailedCreate / BackOff / FailedScheduling events on
the target namespace in the last 30s. False positive on the canonical
DeleteResourceQuota path: deleting the quota silences new
FailedCreate events (no more ReplicaSet attempts), but the Deployment
can still sit at 0/1 because the ReplicaSet was scaled to 0 during
the failure cascade and no controller is going to scale it back up.
Result before this fix: action.phase=Recovered while the workload
was still down, directly contradicting what the operator sees in
Headlamp's plugin (the Sandboxes / Overview / Cluster Health cards
all show 'Workload down' for the same sandbox post-fix).
Tightens observe_recovery to require BOTH:
(1) absence of recent failure events on the target namespace
(existing gate), AND
(2) every Deployment in the target namespace at
availableReplicas >= spec.replicas
(the gate the doc comment promised for Slice 4)
The Deployments gate runs first because it's the more authoritative
signal — if pods aren't available, recovery hasn't happened
regardless of what the event log shows.
Verified live on kind: created a test KarsSREAction targeting a
broken research deployment; the action stayed at phase=Applied
through 3 reconcile passes (workload still down), then flipped to
Recovered on the next pass after the deployment came back to 1/1.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…oncile The McpServer reconciler emitted a Warning event with reason=LimitedSupport on every successful reconcile (~15s cycle), repeating the same static 'singular spec.mcp binding today, plural lands in Slice 4' text. Result: Headlamp's event view was permanently polluted with the same advisory message for every McpServer CR, drowning out actually-actionable events. The information belongs in CRD descriptions and design docs, not in the per-incident K8s Event stream. Removed the call site; kept a breadcrumb comment pointing future readers at the right places to publish the roadmap (mcpserver.spec CRD description + crd-well-oiled-machine blueprint). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The full kars-sre/demo-and-agent series — slices 0 through 4 — landed in 20 commits on this branch. End-to-end demo loop now works through the WebUI and Telegram: incident detected → CR auto-created → operator approves → controller executes → recovery confirmed.
Slice ladder (per docs/blueprints/07-kars-sre-proposal.md §7.1)
tools/demo/act2/— Agent A KarsSandbox + ResourceQuota break + reset + presenter runbooksre_describe_state,sre_diagnose,sre_explain_error,sre_propose_fix,sre_logs)sre_describe_resource,sre_what_changed,sre_endpoints_inspect,sre_image_probe,sre_topkars sre approve/reject/show/actionssre_watcher.pyinformer + Telegram channel adapter wiring + burst collapse + rate limit + terminal-CR reaperWhat's in S3 (typed apply-fix)
State machine:
Proposed → Approved → Applied → Recovered (terminal), withRejected / Expired / Failedterminal lanes.On Approval, the controller:
ClusterRoleBinding kars-sre-write-<uid>scoped to the right writer ClusterRole (kars-sre-writer-quotas|kars-sre-writer-workloads).FailedCreate / BackOff / FailedScheduling / Failedevents (recovery observer) for up to 5 min.Typed actions (closed set per §7.7.1):
DeleteResourceQuota {namespace, name}— refuses quotas labelledkars.azure.com/managed-by=controllerPatchDeploymentImage {namespace, name, container, image}ScaleDeployment {namespace, name, replicas}— clamped to[0, 50]RolloutRestart {namespace, kind, name}— Deployment / StatefulSet / DaemonSetDeletePod {namespace, name}CLI:
Terminal CR reaper: any Recovered/Failed/Expired/Rejected CR older than 1h is GC'd by the reconciler.
What's in S4 (proactive watcher + Telegram)
sre_watcher.pyruns alongside the Hermes gateway whenSRE_ENABLED=trueand a channel is configured. Watches K8s events every 10s for failure-class reasons inkars-*namespaces, builds a typed-action target, and on each new incident:(action_type, namespace, target_name)is already open (Proposed/Approved/Applied), reuse it instead of creating a duplicate. The previous demo showed 40+ identical CRs accumulating without this.SRE_WATCHER_MAX_MSGS_PER_MIN).kubectl delete karssreactions --allclears the dedupe naturally.Telegram wiring uses the existing
kars credentialsmechanism — no new commands:Plumbed via Secret
kars-sre/sre-credentials(envFrom optional:true), exported asTELEGRAM_BOT_TOKEN+TELEGRAM_ALLOW_FROMenv, then translated inentrypoint.shto:channels.telegram.token,channels.telegram.enabled=true,channels.telegram.allowed_users(Hermes config)TELEGRAM_ALLOWED_USERSenv (Hermes gateway pairing-skip)TELEGRAM_HOME_CHANNEL(default forhermes send --to telegram)Channel adapter libraries (
python-telegram-bot 21.x,slack-sdk 3.x,discord.py 2.x) are now pre-installed in the runtime image so credentials in the secret "just work" — no per-sandbox pip install.Sandbox HTTPS proxy:
entrypoint.shnow exportsHTTPS_PROXY=http://127.0.0.1:8444+NO_PROXY=$KUBERNETES_SERVICE_HOST,127.0.0.1,localhost,.svc.cluster.localso any standard-env-honouring HTTP client (httpx, python-telegram-bot, slack-sdk, requests, openai) routes outbound HTTPS through the inference-router's forward proxy — even on kind clusters where the egress-guard iptables transparent-redirect doesn't fire.Demo loop (end-to-end)
RBAC additions (controller-side)
karssreactions(full r/w)resourcequotas: delete— the §7.8.4 K8s privilege-escalation check requires the controller to hold the verbs it grants in the one-shot CRBapps/statefulsets,daemonsets: patch— RolloutRestart targetsevents: list/watch/get— recovery observerserviceaccounts/token: create— lands the §7.8.4 TokenRequest path (currently uses controller SA for execution; structure ready for the hardening pass)clusterrolebindings: create/deletewithresourceNames: ["kars-sre-write-*"]RBAC additions (chart-shipped)
sre-writerServiceAccount inkars-sre(no token automount)kars-sre-writer-quotas/kars-sre-writer-workloadsClusterRoleskars-sre-action-authorClusterRole bound to the SRE sandbox SA (create karssreactions only — operator owns approval)kars:sre-approverClusterRole (operator-facing; not pre-bound)CI gates
Failed/Pending/Degradedliterals — both phases and condition reasons)crd-karssreaction.yamlKarsSREActionSpec(action.type closed-set + approval.state enum + ttlMinutes range + rationale length + control-byte denylist)