From 49fe07dca719debface8963a70f87e402a7ebba5 Mon Sep 17 00:00:00 2001 From: Justin McLean Date: Mon, 29 Jun 2026 16:50:07 +1000 Subject: [PATCH] update specs and implementation plan --- tools/spec-loop/IMPLEMENTATION_PLAN.md | 479 ++++++++++++++---- tools/spec-loop/specs/adapters.md | 20 + tools/spec-loop/specs/adoption-and-setup.md | 13 + .../specs/meta-and-quality-tooling.md | 48 +- .../spec-loop/specs/organization-adapters.md | 10 + tools/spec-loop/specs/pr-management-family.md | 10 +- tools/spec-loop/specs/privacy-llm-gate.md | 12 + tools/spec-loop/specs/project-agnosticism.md | 13 + .../specs/release-management-lifecycle.md | 12 + 9 files changed, 501 insertions(+), 116 deletions(-) diff --git a/tools/spec-loop/IMPLEMENTATION_PLAN.md b/tools/spec-loop/IMPLEMENTATION_PLAN.md index afad7032..77929679 100644 --- a/tools/spec-loop/IMPLEMENTATION_PLAN.md +++ b/tools/spec-loop/IMPLEMENTATION_PLAN.md @@ -30,9 +30,10 @@ one PR** (the branch-per-feature constraint). `pairing-multi-agent-review` (three independent axis passes; eval suites present); `docs/modes.md` Agentic Pairing row reflects 2 skills / `experimental`. Spec: [`specs/pairing-mode.md`](specs/pairing-mode.md). -- **Agentic Mentoring — both skills shipped** — `pr-management-mentor` and - `good-first-issue-author` (eval suites present); `docs/modes.md` - Agentic Mentoring row reflects 2 skills / `experimental`. +- **Agentic Mentoring — four skills shipped** — `pr-management-mentor`, + `good-first-issue-author`, `mentoring-welcome`, and + `contributor-to-committer` (eval suites present); `docs/modes.md` + Agentic Mentoring row reflects 4 skills / `experimental`. Spec: [`specs/mentoring-mode.md`](specs/mentoring-mode.md). - **Contributor skills** — `contributor-nomination`, `contributor-activity-sweep`, and `committer-onboarding` shipped with @@ -53,43 +54,58 @@ one PR** (the branch-per-feature constraint). has `pyproject.toml`, `src/`, and a `tests/` directory with pytest coverage for the sandbox profiles and clean-env wrapper. Spec: [`specs/agent-isolation-sandbox.md`](specs/agent-isolation-sandbox.md). -- **Eval coverage — complete** — 60 skill eval suites exist in - `tools/skill-evals/evals/`, covering all skills including the full - setup-family (setup, setup-isolated-setup-doctor, +- **Eval coverage — complete** — every current `skills/*/SKILL.md` has a + matching eval suite in `tools/skill-evals/evals/`; the eval catalogue also + includes non-skill smoke suites such as `non-asf-profile-smoke`. Coverage + includes the full setup-family (setup, setup-isolated-setup-doctor, setup-isolated-setup-install, setup-isolated-setup-update, setup-isolated-setup-verify, setup-override-upstream, setup-shared-config-sync). -- **Release-management — first four skills shipped** — - `release-vote-draft`, `release-announce-draft`, `release-vote-tally`, - and `release-verify-rc` landed with eval suites (formerly planned work - items 1–2 plus two follow-ups). Six `release-*` skills remain; see +- **Release-management family complete** — all ten `release-*` skills landed + with eval suites; no release-management skill remains proposed. See [`specs/release-management-lifecycle.md`](specs/release-management-lifecycle.md). - **Agentic Triage — general-issue family filled out** — `issue-stale-sweep`, `issue-deduplicate`, and `issue-backlog-stats` shipped with eval suites - (formerly planned work item 3 plus its deferred siblings). + (formerly planned general-issue triage work plus its deferred siblings). Spec: [`specs/triage-mode.md`](specs/triage-mode.md). -- **Agentic Mentoring — first-contribution welcome shipped** — `mentoring-welcome` - landed with an eval suite (formerly planned work item 4). - Spec: [`specs/mentoring-mode.md`](specs/mentoring-mode.md). +- **Contributor-to-committer readiness shipped** — the mentoring-family + `contributor-to-committer` readiness tracker landed with an eval suite and + is documented in the contributor-growth and mentoring family docs. + Spec: [`specs/contributor-growth.md`](specs/contributor-growth.md). - **Project-agnosticism — ASF-coupling advisory lint shipped** — the SOFT ASF-coupling category landed in `tools/skill-and-tool-validator` (formerly planned work item 5), and `drafting-mode.md` Known Gaps is synced to the shipped drafting skills (formerly planned work item 6). -- **Project-agnosticism — capability-flag vocabulary enumerated** — the +- **Project-agnosticism — capability-flag vocabulary and wiring advanced** — the contributor/committer-intake (ICLA vs DCO), security-intake, and CVE-allocation option sets and defaults are enumerated as `projects/_template/committer-onboarding-config.md`, `security-intake-config.md`, and `cve-allocation-config.md`, following - the backend-flag precedent in `release-management-lifecycle.md`. Wiring - the skills to read these flags is tracked as work item 3. + the backend-flag precedent in `release-management-lifecycle.md`. + `security-issue-import`, `security-issue-sync`, and `committer-onboarding` + have begun reading those flags; remaining adopter-pilot feedback is tracked + in the specs, not as an immediate build item here. Spec: [`specs/project-agnosticism.md`](specs/project-agnosticism.md). -- **Repo-health — three-skill family shipped** — `ci-runner-audit`, - `workflow-security-audit`, and `dependency-audit` landed (read-only, - `experimental`). Spec: [`specs/repo-health-family.md`](specs/repo-health-family.md). -- **New proposed specs awaiting their first build item** — - [`specs/reviewer-routing.md`](specs/reviewer-routing.md) (Agentic Triage) and - [`specs/skill-reconciler.md`](specs/skill-reconciler.md) (infra) are - documented spec-first; their build items are below. +- **Repo-health family complete** — `ci-runner-audit`, + `workflow-security-audit`, `dependency-audit`, `license-compliance-audit`, + and `flaky-test-triage` landed (read-only, `experimental`). + Spec: [`specs/repo-health-family.md`](specs/repo-health-family.md). +- **Reviewer routing shipped** — `reviewer-routing` landed with an eval suite, + filling the first reviewer-routing spec build item. Remaining work is spec / + docs cleanup for the shipped state and later adopter-pilot feedback. + Spec: [`specs/reviewer-routing.md`](specs/reviewer-routing.md). +- **Skill reconciler shipped** — `skill-reconciler` landed with an eval suite, + implementing the cross-project comparison workflow. Follow-on gaps are the + optional deterministic structural-diff helper and source-tag auto-pairing, + both deferred. Spec: [`specs/skill-reconciler.md`](specs/skill-reconciler.md). +- **Project-agnosticism cleanup shipped** — high-confidence ASF-coupling + advisories, criteria-source advisories, and action-inventory advisories were + cleared from the relevant skills; organization metadata, governance + vocabulary, disclosure-governance flags, and source-control abstraction work + also landed. Spec: [`specs/project-agnosticism.md`](specs/project-agnosticism.md). +- **Good-first-issue sweep implemented off main** — `origin/good-first-issue-sweep` + carries the `good-first-issue-sweep` skill and eval suite. It is tracked as + in-flight below until that PR lands on `main`. --- @@ -100,15 +116,14 @@ Do not duplicate them. | Branch slug | PR | Description | |---|---|---| -| `non-asf-profile-fixture` | open | Non-ASF adopter profile under `projects/_template/` + `non-asf-profile-smoke` eval (former work item 3) | +| `good-first-issue-sweep` | open | `good-first-issue-sweep` skill + eval suite; keep out of the build queue until the PR lands or is explicitly abandoned. | -The previous in-flight batch (spec-validator SPDX / path-existence / -Known-gaps checks, the `spec-validate` pre-commit hook, the SOFT -eval-coverage check, `pr-management-quick-merge`, the security-tracker -dashboard pytest suite, the loop incremental-sync and CLI-UX changes, the -markdownlint Node bump, the AGENTS.md slim, and the modes / mentoring / -setup-status doc syncs) has all merged and is reflected in the code and -in **What's been built** above. +The previous in-flight batch (non-ASF adopter profile fixture, +reviewer-routing, skill-reconciler, release-management completion, +repo-health completion, high-confidence ASF-coupling cleanup, criteria-source / +action-inventory advisory cleanup, organization metadata, governance vocabulary, +and adapter-discovery docs) has merged to `main` and is reflected in +**What's been built** above. --- @@ -117,72 +132,337 @@ in **What's been built** above. Priority order. Each maps to one branch and one PR. Branch names are slugs, not numbers (numbering implies an order the specs don't carry). -1. **First reviewer-routing skill: reviewer-routing.** - `specs/reviewer-routing.md` is `proposed` with zero implemented - skills, and review-cycle latency is one of the two priorities MISSION - names. Add an Agentic Triage-family skill `reviewer-routing` that takes an open - issue or PR and proposes a primary reviewer (and optional backup) from - the project's configured roster, scored on roster eligibility for the - touched area, git-history familiarity with the changed paths, and the - reviewer's current open-review load. Read-only / propose-then-confirm: - it never assigns or requests review. An unresolved roster yields an - explicit `NO ELIGIBLE REVIEWER` signal, never a fabricated handle. - Include an eval suite with an adversarial case asserting an injected - "assign to X" line in a PR body is ignored. +1. **Sync shipped-state specs after the recent merge train.** + Several specs still carry pre-merge language even though the code has + shipped. Update `specs/reviewer-routing.md` and + `specs/skill-reconciler.md` so their **Where it lives** and **Known gaps** + sections describe the shipped skills instead of saying "proposed, not + implemented"; update `specs/overview.md` so reviewer routing and the + reconciler are listed as `experimental`; refresh + `specs/meta-and-quality-tooling.md`'s shipped-skill/eval count; and verify + `specs/project-agnosticism.md` / `specs/issue-management-family.md` no longer + advertise already-cleared gaps (high-confidence ASF-coupling backlog, + unwired governance-member terminology, missing issue-management rows in + `docs/modes.md`). + Validation: + ```bash + uv run --project tools/spec-status-index spec-status --ready + uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate + ``` + Branch `spec-shipped-state-sync`. + +2. **Post-merge sync for good-first-issue-sweep.** + Once the `good-first-issue-sweep` PR lands on `main`, remove it from the + in-flight table and sync every shipped-state surface: flip + `specs/good-first-issue-sweep.md` from `proposed` to `experimental`, + update `specs/overview.md`, add the skill to `docs/modes.md` and the + mentoring / contributor-growth family docs, and update the eval-coverage + counts if they are still numeric. This item is intentionally blocked until + the PR lands; do not duplicate the branch implementation. + Validation: + ```bash + test -f skills/good-first-issue-sweep/SKILL.md + test -d tools/skill-evals/evals/good-first-issue-sweep + uv run --project tools/spec-status-index spec-status --ready + uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate + ``` + Spec: [`specs/good-first-issue-sweep.md`](specs/good-first-issue-sweep.md). + Branch `good-first-issue-sweep-post-merge-sync`. + +3. **Clear the mechanical SOFT validator warnings.** + Handle the current non-judgement soft warnings that have obvious local + remedies: add the missing Privacy-LLM gate preflight to + `reviewer-routing`, add an explicit bounded `--limit` to the + `security-issue-import` `gh issue list` call, and replace the + `release-prepare` inline `--body "..."` usage with a `--body-file` flow. + Leave ASF-coupling warnings out of this item; those require human + judgement and are tracked separately below. Validation: ```bash uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/reviewer-routing/ + uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/security-issue-import/ + uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/release-prepare/ + ``` + Branch `mechanical-soft-warning-cleanup`. + +4. **Low-confidence ASF-coupling judgement pass.** + The high-confidence coupling backlog is clear, but the validator still + reports low-confidence `asf-coupling` warnings such as bare governance + terms (`PMC`) and contributor-intake terms (`ICLA`). Review each warning in + context and classify it as one of three outcomes: convert to a placeholder, + route through an existing capability flag, or explicitly keep as an + ASF-default example. The output should be a narrow set of skill/doc edits + plus a short note in `specs/project-agnosticism.md` explaining which + residual warnings are intentionally advisory. + Validation: + ```bash + uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate + uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/committer-onboarding/ + uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/contributor-nomination/ + uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/release-promote/ + ``` + Spec: [`specs/project-agnosticism.md`](specs/project-agnosticism.md). + Branch `low-confidence-asf-coupling-pass`. + +5. **Add an adopter-pilot feedback harness.** + Many experimental family specs now share the same real gap: no adopter has + run the full skill family end-to-end. Add a lightweight pilot-report + template and helper (or a documented `tools/` command if that better matches + existing tooling) that records the skill run, target repo/profile, blocked + preflights, false positives, confirmation points, privacy/adapter notes, and + proposed spec updates. Wire the template into the relevant experimental + family docs so pilot evidence is captured consistently without turning it + into a continuous monitor. + Validation: + ```bash + uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate + uv run --project tools/spec-validator --group dev pytest ``` - Spec: [`specs/reviewer-routing.md`](specs/reviewer-routing.md). - Branch `reviewer-routing`. + Spec: [`specs/meta-and-quality-tooling.md`](specs/meta-and-quality-tooling.md). + Branch `adopter-pilot-feedback-harness`. -2. **Cross-project skill reconciler: skill-reconciler.** - `specs/skill-reconciler.md` is `proposed` with no implementation. Add - a meta/infra-family skill `skill-reconciler` that compares two - near-duplicate skills (two `source`-tagged copies, e.g. an ASF and a - non-ASF variant) and emits a structured diff plus a reconciliation - proposal, labelling every difference `ALLOWED`, `DRIFT`, or - `SAFETY-BASELINE`. Read-only: it proposes, it never rewrites either - skill (convergence is a separate confirmed `write-skill` / - `optimize-skill` edit). A safety-baseline divergence is always a - must-fix and never folded into allowed-divergence noise. First cut may - take two explicit paths rather than auto-pairing by `source` tag. - Include an eval suite with a case where the two copies diverge only on - the safety baseline and the reconciler must flag it. +6. **Expand organization-adapter smoke coverage.** + The non-ASF profile smoke test proves one issue-management path. Extend + smoke coverage across at least three organization-sensitive surfaces: + security intake (`security-intake-config.md` / disclosure-governance + flags), release backend selection (`release-management-config.md`), and + contributor governance (`committer-onboarding-config.md`). The goal is not + new product behaviour; it is executable confidence that organization + defaults and project overrides work outside an ASF-shaped profile. + Validation: + ```bash + uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate + uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/non-asf-profile-smoke/ + uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/security-issue-import/ + uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/release-prepare/ + uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/committer-onboarding/ + ``` + Spec: [`specs/organization-adapters.md`](specs/organization-adapters.md). + Branch `organization-adapter-smoke-expansion`. + +7. **Add a dedicated pr-management-code-review eval suite.** + `specs/pr-management-family.md` still calls out that + `pr-management-code-review` lacks a dedicated eval suite. Add + `tools/skill-evals/evals/pr-management-code-review/` with focused cases for + selector resolution, review-risk classification, AI-generated-code signal + handling, prompt-injection-in-PR-content handling, and the final review + handoff. Keep the suite read-only: it should assert the review findings and + handoff shape, not require live GitHub writes. + Validation: + ```bash + uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate + uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/pr-management-code-review/ + ``` + Spec: [`specs/pr-management-family.md`](specs/pr-management-family.md). + Branch `pr-management-code-review-evals`. + +8. **Extract the skill-reconciler safety-baseline checklist.** + The shipped `skill-reconciler` recognizes safety-baseline divergence from + prose patterns. Extract the baseline clauses into one canonical checklist + file that both humans and tooling can reference: untrusted content is never + instructions, collaborator / identity-resolution caveats are preserved, and + confidentiality posture is not weakened. Update `skill-reconciler` to cite + that checklist and add eval coverage proving a divergence in any checklist + item is classified as `SAFETY-BASELINE`. Validation: ```bash uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/skill-reconciler/ ``` Spec: [`specs/skill-reconciler.md`](specs/skill-reconciler.md). - Branch `skill-reconciler`. - -3. **Clear the high-confidence ASF-coupling advisory backlog.** - The SOFT ASF-coupling lint (shipped, see **What's been built**) still - flags ~62 high-confidence couplings, almost all in the - release-management skills: hardcoded ASF dist-tree paths (`dist/dev/`, - `dist/release/`), `svn mv` / `svn commit` / `svn checkout` - distribution commands, and the literal `announce@apache.org` list. The - capability-flag vocabulary and `release-management-config.md`'s backend - flags (`release-dist-backend`, `release-announce-backend`) already - exist; this item wires the release skills (`release-rc-cut`, - `release-promote`, `release-archive-sweep`, `release-keys-sync`, - `release-prepare`, `release-verify-rc`, `release-vote-draft`, - `release-vote-tally`, `release-announce-draft`) plus - `security-issue-sync` to read those flags / use the `` - placeholder instead of hardcoding ASF specifics, regressing no - behaviour for the ASF default profile. Low-confidence advisories (bare - `PMC`, `ICLA`, `incubator`) are out of scope: the SOFT lint leaves - those to contributor self-judgement. Done when the validator reports - zero high-confidence asf-coupling warnings. + Branch `skill-reconciler-safety-baseline-checklist`. + +9. **Add adapter authoring smoke validation.** + Adapter discovery and authoring docs have landed; add a validator or smoke + fixture that checks each tool / adapter README declares the required authoring + fields: capability, prerequisites, privacy / credential handling, operations, + and config keys. Keep this as an advisory or narrowly scoped hard check based + on existing docs so legacy adapters can be brought into compliance + deliberately rather than through unrelated churn. Validation: ```bash uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate - uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/release-promote/ + uv run --project tools/spec-validator --group dev pytest + ``` + Spec: [`specs/adapters.md`](specs/adapters.md). + Branch `adapter-authoring-smoke-validation`. + +10. **Add docs/modes.md generated consistency checks.** + `docs/modes.md` is a high-traffic index, and recent work has repeatedly + needed manual count / skill-list syncs after new skills landed. Add a + validator check (or a small generated-consistency helper invoked by the + validator) that compares the mode tables against live `skills/*/SKILL.md` + frontmatter: each shipped skill appears in the expected mode section, status + counts match the frontmatter, and no removed skill remains listed. Keep the + first version focused on detection; rewriting the doc can remain a separate + human-confirmed update. + Validation: + ```bash + uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate + uv run --project tools/skill-and-tool-validator --group dev pytest + ``` + Spec: [`specs/meta-and-quality-tooling.md`](specs/meta-and-quality-tooling.md). + Branch `modes-doc-consistency-check`. + +11. **Normalize tool README prerequisites consistency.** + Tool README prerequisites are now part of the authoring contract, but older + tool docs may still vary in section shape and required credential / runtime + detail. Sweep `tools/*/README.md` for the Prerequisites section, normalize + the expected headings and wording where the existing tool behaviour is + clear, and tighten the validator only after the tree is brought into + compliance. Keep adapter-specific privacy / credential checks in the + adapter-authoring smoke item above; this item is the general README + prerequisite contract. + Validation: + ```bash + uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate + uv run --project tools/skill-and-tool-validator --group dev pytest + ``` + Spec: [`specs/meta-and-quality-tooling.md`](specs/meta-and-quality-tooling.md). + Branch `tool-readme-prerequisites-consistency`. + +12. **Tighten skill frontmatter schema validation.** + Strengthen the validator's frontmatter contract for `mode`, `status`, + `capability`, `organization`, and `source`: modes and statuses must be from + the documented vocabulary; organizations must exist under `organizations/`; + multi-capability skills must use a YAML list consistently; and every shipped + experimental skill must have a matching eval suite unless it is explicitly + exempted with a documented reason. Keep the first pass focused on fields the + current tree can satisfy after local cleanup. + Validation: + ```bash + uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate + uv run --project tools/skill-and-tool-validator --group dev pytest + ``` + Spec: [`specs/meta-and-quality-tooling.md`](specs/meta-and-quality-tooling.md). + Branch `skill-frontmatter-schema-tightening`. + +13. **Add project-template drift checks.** + Add a validator or smoke tool that compares `projects/_template/` with + `projects/non-asf-example/` for structural drift: required config files are + present, documented keys exist in both profiles when applicable, template-only + keys are either copied or intentionally explained, and organization-inherited + defaults do not hide missing adopter-required values. The check should catch + stale template docs without forcing the example to mirror ASF-specific values. + Validation: + ```bash + uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate + uv run --project tools/skill-and-tool-validator --group dev pytest + uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/non-asf-profile-smoke/ ``` Spec: [`specs/project-agnosticism.md`](specs/project-agnosticism.md). - Branch `asf-coupling-cleanup`. + Branch `project-template-drift-check`. + +14. **Add override-file contract tests.** + Document and test the `.apache-magpie-overrides/.md` contract: override + files are additive project guidance, agent-readable Markdown, and never a + replacement for the framework safety / confidentiality baseline. Add a + validator or smoke fixture that flags override text attempting to weaken the + baseline and confirms a clean override can be discovered and surfaced to a + skill without editing the upstream skill body. + Validation: + ```bash + uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate + uv run --project tools/skill-and-tool-validator --group dev pytest + ``` + Spec: [`specs/adoption-and-setup.md`](specs/adoption-and-setup.md). + Branch `override-file-contract-tests`. + +15. **Add capability taxonomy coverage checks.** + Validate that every `capability` declared in skill frontmatter and tool + READMEs is documented in `docs/labels-and-capabilities.md`, and that every + capability in the taxonomy maps to at least one skill/tool or is explicitly + marked reserved / future. The check should catch misspellings and stale + taxonomy rows without requiring every capability to have both a skill and a + tool implementation. + Validation: + ```bash + uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate + uv run --project tools/skill-and-tool-validator --group dev pytest + ``` + Spec: [`specs/meta-and-quality-tooling.md`](specs/meta-and-quality-tooling.md). + Branch `capability-taxonomy-coverage-check`. + +16. **Define the release audit report schema.** + `release-audit-report` exists, but downstream review would benefit from a + structured audit-record schema. Add a template/schema for the required audit + fields (release version, RC artefacts, vote thread, tally outcome, promotion + revision, announcement URL, archive state, and any follow-up notes), update + the skill to reference it, and add eval fixtures that reject incomplete audit + records while preserving the human-reviewed nature of the report. + Validation: + ```bash + uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate + uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/release-audit-report/ + ``` + Spec: [`specs/release-management-lifecycle.md`](specs/release-management-lifecycle.md). + Branch `release-audit-report-schema`. + +17. **Add mail-adapter privacy-boundary tests.** + Add smoke tests or validator fixtures for Gmail, PonyMail, `mail-archive`, + and any `mail-source` adapter path proving private mail content is redacted, + summarized, or routed through the Privacy-LLM gate before it enters + model-facing skill context. The test should treat fetched mail as external + data and include at least one prompt-injection-in-email fixture to preserve + the repository's data-not-instructions rule. + Validation: + ```bash + uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate + uv run --project tools/spec-validator --group dev pytest + ``` + Spec: [`specs/adapters.md`](specs/adapters.md). + Branch `mail-adapter-privacy-boundary-tests`. + +18. **Add branch-name confidentiality validation.** + Add a validator check or deterministic helper that scans generated branch + name examples in skills/docs and rejects embargo-breaking terms: CVE IDs, + `security`, `vulnerability`, `advisory`, and tracker-private title fragments. + Align the check with the existing security-fix workflow guidance so public + branch names stay neutral before disclosure. + Validation: + ```bash + uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate + uv run --project tools/skill-and-tool-validator --group dev pytest + uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/security-issue-fix/ + ``` + Spec: [`specs/privacy-llm-gate.md`](specs/privacy-llm-gate.md). + Branch `branch-name-confidentiality-validation`. + +19. **Add the deterministic structural-diff helper for skill-reconciler.** + The shipped `skill-reconciler` reasons over a prose comparison report. + Add the optional `tools/` helper sketched in + `specs/skill-reconciler.md`: parse two skill trees into a normalized + structural diff (frontmatter, section headings, step inventory, + placeholder inventory, and linked support files) so the skill can ground + `ALLOWED` / `DRIFT` / `SAFETY-BASELINE` decisions in a deterministic + object. Keep the reconciler read-only; the helper emits data only. + Include unit tests for frontmatter-only, section-order, placeholder, and + support-file divergences, plus one safety-baseline fixture that proves the + helper preserves the clauses the skill must classify. + Validation: + ```bash + uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate + uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/skill-reconciler/ + uv run --project tools/skill-reconciler-diff --group dev pytest + ``` + Spec: [`specs/skill-reconciler.md`](specs/skill-reconciler.md). + Branch `skill-reconciler-structural-diff`. + +20. **Add source-tag auto-pairing to skill-reconciler.** + The first implementation takes two explicit paths. Extend the skill so + a maintainer can ask it to discover near-duplicate skills by `source` + tag / capability metadata and present a bounded candidate pair list + before running the comparison. Preserve explicit-path mode as the + default and require confirmation before comparing any discovered pair, + so the skill remains read-only and predictable. + Validation: + ```bash + uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-validate + uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/skill-reconciler/ + ``` + Spec: [`specs/skill-reconciler.md`](specs/skill-reconciler.md). + Branch `skill-reconciler-source-pairing`. --- @@ -197,36 +477,23 @@ slugs, not numbers (numbering implies an order the specs don't carry). it would skip the proof MISSION requires. - When a build iteration creates a new skill, its eval suite is part of that same work item — not a separate one. -- **Release-management family:** the first four skills (`release-vote-draft`, - `release-announce-draft`, `release-vote-tally`, `release-verify-rc`) - have shipped and are recorded in **What's been built**. The remaining - six (`release-prepare`, `release-keys-sync`, `release-rc-cut`, - `release-promote`, `release-archive-sweep`, `release-audit-report`) - should be planned in subsequent passes now that the first four have - established the skill-authoring patterns for this family. +- **Release-management family:** all ten skills have shipped and are recorded + in **What's been built**. Further release-management work should come from + adopter-pilot evidence or newly accepted specs, not from the old "remaining + six skills" queue. - **Agentic Triage contributor-growth gaps** (PMC-member nomination, emeritus-committer handling, contributor offboarding) noted in `triage-mode.md` Known Gaps are intentionally deferred: they are vague enough that a spec-RFC conversation is more appropriate than a direct build item. -- **Project-agnosticism:** the ASF-coupling advisory lint has shipped - (recorded in **What's been built**); the non-ASF adopter profile - fixture is in flight (PR open, see **In-flight**). The capability-flag - vocabulary for contributor/committer intake (ICLA vs DCO), security - intake, and CVE allocation has been enumerated and shipped to main - (recorded in **What's been built**), following the backend-flag - precedent set by `release-management-lifecycle.md` (distribution / - approval / announcement backends). The remaining follow-on is wiring - the skills to read those flags (work item 3) — an engineering task, not - a spec-authoring one. The SOFT ASF-coupling lint still reports ~62 - high-confidence couplings (hardcoded `dist/` paths, `svn` commands, - `announce@apache.org`), almost all in the release-management skills, - which is the measurable backlog work item 3 clears. +- **Project-agnosticism:** the ASF-coupling advisory lint, the non-ASF adopter + profile fixture, the capability-flag vocabulary, and the high-confidence + coupling cleanup have shipped. Remaining low-confidence advisories (for + example bare governance terms that may be legitimate ASF defaults) stay + human-judgement items unless a future spec turns them into a hard rule. - **General-issue dedupe and backlog dashboard** (`triage-mode.md` Known Gaps) have shipped (`issue-deduplicate`, `issue-backlog-stats`) alongside `issue-stale-sweep`; see **What's been built**. No longer planned items. -- **Repo-health family** has shipped its first three members - (`ci-runner-audit`, `workflow-security-audit`, `dependency-audit`) under - its own [`specs/repo-health-family.md`](specs/repo-health-family.md); - remaining candidates (license / NOTICE compliance, flaky-test detection) - are deferred to a subsequent pass. +- **Repo-health family** has shipped all five designed members under + [`specs/repo-health-family.md`](specs/repo-health-family.md). No additional + repo-health skill is planned until adopter-pilot runs produce a concrete gap. diff --git a/tools/spec-loop/specs/adapters.md b/tools/spec-loop/specs/adapters.md index 744e5e41..36e589ae 100644 --- a/tools/spec-loop/specs/adapters.md +++ b/tools/spec-loop/specs/adapters.md @@ -57,6 +57,15 @@ by swapping the adapter, not the skill. ([privacy-llm-gate.md](privacy-llm-gate.md)) before any LLM read. - **Write-back is confirm-before-apply** and routed through the sandbox's `ask` gate ([agent-isolation-sandbox.md](agent-isolation-sandbox.md)). +- **Adapter READMEs are contracts.** Every adapter README declares the + capability it provides, prerequisites, credential/privacy handling, + supported operations, and adopter config keys. These fields let a + validator distinguish an intentional adapter surface from undocumented + shell prose. +- **Private mail is hostile input.** Gmail, PonyMail, `mail-archive`, and + `mail-source` content is external data, never instructions. Tests for + mail adapters should include prompt-injection text in fetched mail and + prove it is carried as report data only after redaction/gating. ## Out of scope @@ -69,6 +78,10 @@ by swapping the adapter, not the skill. adapter + placeholder. 2. Mail adapters draft only and redact before LLM read. 3. Each adapter ships with its own tests. +4. Adapter READMEs declare capability, prerequisites, + privacy/credential handling, operations, and config keys. +5. Mail-adapter tests prove private fetched content crosses the + Privacy-LLM/redaction boundary before model-facing skill context. ## Validation @@ -86,3 +99,10 @@ done ASF coupling across the catalogue, and the capability-flag mechanism for workflow branches that no adapter resolves, live in [project-agnosticism.md](project-agnosticism.md). +- **Adapter authoring smoke validation is missing.** The docs define the + expected README contract, but no validator currently checks that each + adapter declares capability, prerequisites, privacy/credential handling, + operations, and config keys. +- **Mail-adapter privacy tests are thin.** The redaction contract exists, + but adapter-level fixtures should prove that private mail and embedded + prompt-injection attempts do not enter model-facing context untreated. diff --git a/tools/spec-loop/specs/adoption-and-setup.md b/tools/spec-loop/specs/adoption-and-setup.md index 2a27ad9e..615ee649 100644 --- a/tools/spec-loop/specs/adoption-and-setup.md +++ b/tools/spec-loop/specs/adoption-and-setup.md @@ -61,6 +61,12 @@ gitignored skill symlinks, and committed agent-readable override files. - **Overrides are agent-readable Markdown** under `.apache-magpie-overrides/`, consulted at runtime and merged before default behaviour ([pairing/correctability is the model]). +- **Overrides are additive, never authority inversion.** An override may + supply adopter-specific process details, paths, labels, or wording, but + it must not replace or weaken the framework's safety, confidentiality, + privacy, or external-content-as-data baseline. If an override conflicts + with those baseline rules, the framework rule wins and the conflict is + surfaced. ## Out of scope @@ -73,6 +79,9 @@ gitignored skill symlinks, and committed agent-readable override files. 1. Adoption commits only the bootstrap skill + lock/override scaffold. 2. The committed lock re-installs the same version on a fresh clone. 3. Drift between local and committed locks is surfaced with an upgrade. +4. Override files can be discovered and surfaced to skills without + editing upstream skill bodies, and override text cannot weaken the + safety/confidentiality baseline. ## Validation @@ -86,3 +95,7 @@ uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-valid - `stable`; gaps appear as new agent targets to add to the registry ([`agents.md`](../../../skills/setup/agents.md)) or new override surfaces — recorded by the plan pass. +- **Override-file contract tests are missing.** The docs describe + agentic overrides, but no smoke fixture proves that clean overrides are + additive or that an override attempting to relax safety/confidentiality + rules is flagged rather than applied. diff --git a/tools/spec-loop/specs/meta-and-quality-tooling.md b/tools/spec-loop/specs/meta-and-quality-tooling.md index 5d1fb604..e18a9fb7 100644 --- a/tools/spec-loop/specs/meta-and-quality-tooling.md +++ b/tools/spec-loop/specs/meta-and-quality-tooling.md @@ -64,6 +64,18 @@ trustworthy as it grows. heuristic/text tools with no model calls — reproducible in CI. - **Hard vs soft rules.** The validator fails on missing frontmatter or broken links; advisories are warnings unless `--strict`. +- **Schema-backed metadata.** Skill frontmatter, tool README capability + declarations, and family/index docs are treated as machine-checkable + contracts. New checks should prefer clear enum/list validation over + prose inference when the repository already declares the vocabulary. +- **Generated-index consistency.** Human-facing catalogue pages such as + `docs/modes.md` may stay hand-written, but validator checks should + compare their skill lists and counts against live `skills/*/SKILL.md` + frontmatter so documentation drift is visible before review. +- **Pilot evidence is structured.** Experimental-family pilot reports + should capture the same minimal fields every time: skill/family, + target repo/profile, blocked preflights, false positives, confirmation + points, privacy/adapter notes, and proposed spec changes. ## Out of scope @@ -76,6 +88,15 @@ trustworthy as it grows. 1. `skill-and-tool-validate` enforces required frontmatter + link integrity. 2. `list-skills` generates its index from live frontmatter. 3. Each meta tool ships with its own tests. +4. Frontmatter values for `mode`, `status`, `capability`, + `organization`, and `source` are validated against documented + vocabularies; unknown organizations fail unless the organization + exists under `organizations/`. +5. Capabilities declared in skill frontmatter and tool READMEs are + present in `docs/labels-and-capabilities.md`; taxonomy entries with no + implementation are explicitly marked reserved or future. +6. `docs/modes.md` skill lists and shipped counts are checked against + live skill frontmatter. ## Validation @@ -86,10 +107,23 @@ uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-valid ## Known gaps -- **Eval coverage is complete.** All 44 shipped skills have a matching - suite in `tools/skill-evals/evals/`; the soft eval-coverage check in - `skill-and-tool-validator` (check #8) warns when a newly added skill - has no suite, keeping coverage complete going forward. -- Other gaps appear as new quality checks worth adding — recorded by the - plan pass. The spec validator (analogous to the skill validator) and - the ASF-coupling advisory lint are two recent additions to this surface. +- **Eval coverage is complete.** Every shipped skill has a matching suite + in `tools/skill-evals/evals/`; the soft eval-coverage check in + `skill-and-tool-validator` warns when a newly added skill has no suite, + keeping coverage complete going forward. +- **Frontmatter validation is still shallow.** Current validation covers + required fields, but the next pass should make `mode`, `status`, + `capability`, `organization`, and `source` combinations explicit and + test-backed. +- **Capability taxonomy drift is not yet checked.** The validator should + catch misspelled or undocumented capability values, and should surface + taxonomy rows that no skill/tool implements unless they are marked + reserved. +- **`docs/modes.md` is manually synced.** The plan tracks a generated + consistency check so mode tables and shipped counts cannot silently + drift from skill frontmatter. +- **Tool README prerequisites vary.** A prerequisites consistency pass + should normalize older tool READMEs before tightening the validator. +- **Pilot evidence has no common shape.** Experimental-family specs all + need adopter evidence, but there is no standard pilot-report template + or helper yet. diff --git a/tools/spec-loop/specs/organization-adapters.md b/tools/spec-loop/specs/organization-adapters.md index 4b6db859..841b8acb 100644 --- a/tools/spec-loop/specs/organization-adapters.md +++ b/tools/spec-loop/specs/organization-adapters.md @@ -73,6 +73,10 @@ identical "ASF default" values in its own `project.md`. - The `organizations/ASF/organization.md` keys mirror the namespaces of the project manifest's *Security workflow configuration* section so resolution is mechanical. +- Organization smoke coverage should exercise more than one family. At + minimum, a non-ASF profile must be able to drive security intake + backend selection, release backend selection, and contributor-governance + defaults without editing skill bodies. ## Out of scope @@ -94,6 +98,8 @@ identical "ASF default" values in its own `project.md`. first hit wins; no skill branches on the organization. - A new organization can be authored from `organizations/_template/` with no skill edits. +- Smoke fixtures cover security intake, release backend, and contributor + governance defaults for at least one non-ASF profile. ## Validation @@ -121,3 +127,7 @@ resolves to the baseline. convention). - The family-level `organization:` scope (replacing `asf: true/false`) and the external-adapter discovery index are separate follow-ups. +- **Smoke coverage is narrow.** The non-ASF profile smoke currently proves + one path. The next coverage pass should exercise security intake, + release backend selection, and contributor governance so organization + defaults are tested across the surfaces most likely to drift. diff --git a/tools/spec-loop/specs/pr-management-family.md b/tools/spec-loop/specs/pr-management-family.md index 4a4f6d3d..a6278909 100644 --- a/tools/spec-loop/specs/pr-management-family.md +++ b/tools/spec-loop/specs/pr-management-family.md @@ -137,6 +137,9 @@ is listed here for navigability since its domain is PR threads. 4. `pr-management-stats` emits read-only tables without mutating any tracker or PR state. 5. All family skills pass `skill-and-tool-validate` with no errors. +6. `pr-management-code-review` has a dedicated eval suite covering + selector resolution, review-risk classification, AI-generated-code + signals, prompt injection in PR content, and the final review handoff. ## Validation @@ -161,9 +164,10 @@ uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-valid code-review skill ships without a matching suite in `tools/skill-evals/evals/pr-management-code-review/`; the SOFT eval-coverage check in `skill-and-tool-validator` flags this. Adding a - step-level fixture set (at minimum an adversarial prompt-injection case - and a typical APPROVE output) is the next concrete quality improvement - for this family. + step-level fixture set is the next concrete quality improvement for + this family. Minimum coverage: selector resolution, review-risk + classification, AI-generated-code signal handling, prompt injection in + PR body/comments, and a typical APPROVE / REQUEST_CHANGES handoff. - **Stale-PR handling is built into `pr-management-triage`.** Dedicated stale sweeps (`stale-draft`, `inactive-open`, `stale-review-ping`) run as Step 5 of the triage flow and can be invoked standalone via diff --git a/tools/spec-loop/specs/privacy-llm-gate.md b/tools/spec-loop/specs/privacy-llm-gate.md index 35a9436c..be984bf8 100644 --- a/tools/spec-loop/specs/privacy-llm-gate.md +++ b/tools/spec-loop/specs/privacy-llm-gate.md @@ -57,6 +57,12 @@ artefact for leakage before emission. and any project-declared private string; failures stop the flow. - **Audit log is privacy-aware** — references hashed identifiers, never raw PII. +- **Public branch names are public artefacts.** Generated branch names, + commit-message examples, PR-body templates, changelog snippets, and + release-note text must avoid embargo-breaking security terms before + disclosure. In particular, pre-disclosure public branch names must not + contain CVE IDs, `security`, `vulnerability`, `advisory`, or + tracker-private title fragments. ## Out of scope @@ -71,6 +77,8 @@ artefact for leakage before emission. map local (0600, gitignored). 3. The scrub catches CVE IDs / reporter names / list addresses before any public write. +4. Generated public branch-name examples are scrubbed for CVE IDs and + embargoed security framing before use. ## Validation @@ -82,3 +90,7 @@ uv run --project tools/privacy-llm --group dev pytest - `stable`; gaps surface as new PII patterns or new public-emission surfaces not yet covered by the scrub — caught as drift by the plan pass. +- **Branch-name confidentiality validation is missing.** Security-fix + workflows already require neutral branch names, but no deterministic + check scans skill/docs examples for CVE IDs or embargoed terms in + generated branch names. diff --git a/tools/spec-loop/specs/project-agnosticism.md b/tools/spec-loop/specs/project-agnosticism.md index 95b6e914..269dd2bb 100644 --- a/tools/spec-loop/specs/project-agnosticism.md +++ b/tools/spec-loop/specs/project-agnosticism.md @@ -108,6 +108,11 @@ The three mechanisms, in order of preference: - **Advisory, not paternalistic.** The audit surfaces candidate coupling for a maintainer to judge; some ASF strings are legitimate (examples, the ASF default profile, ASF-specific docs). It does not auto-rewrite. +- **Template and example profiles stay comparable.** `projects/_template/` + is the adopter contract; `projects/non-asf-example/` is the proof that + a non-ASF adopter can satisfy that contract. Required files and config + keys should be structurally comparable, with omissions explained rather + than silently drifting. ## Out of scope @@ -129,6 +134,9 @@ The three mechanisms, in order of preference: `` flag, not on skill edits. 3. The ASF profile runs the catalogue unchanged (default-valued flags), and a non-ASF profile can be declared without editing any skill body. +4. The template profile and non-ASF example expose the same required + config surfaces, except where the example documents an intentional + omission or an organization-inherited default. ## Validation @@ -173,3 +181,8 @@ uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-valid catalogue (bare `PMC`, `ICLA`, `announce@apache.org`) is surfaced by the advisory lint (check #10 in `skill-and-tool-validator`) for human judgement. +- **Template/profile drift is not mechanically checked.** The non-ASF + example is now a real smoke fixture, but no validator compares its file + and key surface against `projects/_template/`. A drift check should + catch missing required files, stale documented keys, and hidden + organization-default assumptions. diff --git a/tools/spec-loop/specs/release-management-lifecycle.md b/tools/spec-loop/specs/release-management-lifecycle.md index 4d6b1002..ce18efdd 100644 --- a/tools/spec-loop/specs/release-management-lifecycle.md +++ b/tools/spec-loop/specs/release-management-lifecycle.md @@ -119,6 +119,11 @@ code lands. state-changing lane, requires evidence from Release Managers and binding voters that the process is healthier (fewer stalled RCs, shorter time-to-`[ANNOUNCE]`, fewer reverted promotions). +- **Audit records are structured.** `release-audit-report` output should + follow a schema/template with required fields for release version, RC + artefacts, vote thread, tally outcome, promotion revision, announcement + URL, archive state, and follow-up notes. Missing required fields are a + report finding, not silently omitted prose. ## Out of scope @@ -139,6 +144,9 @@ code lands. 3. No skill in the family signs, imports, promotes, sends, or merges on autopilot; the key-holding and publishing steps emit paste-ready recipes only. +4. `release-audit-report` validates the audit record against the + required-field schema and flags incomplete lifecycle evidence before + proposing an audit-log PR. ## Validation @@ -171,3 +179,7 @@ uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/release-an cut a full release through the family yet, so the RM/binding-voter evidence window that would justify default-on or a state-changing lane has no data behind it. +- **Release audit record schema is prose-only.** The audit-report skill + exists, but there is no structured schema/template that downstream + review can validate. The plan tracks a schema and eval fixtures for + incomplete records.