apache · justinmclean · Jun 29, 2026 · Jun 29, 2026
diff --git a/tools/spec-loop/IMPLEMENTATION_PLAN.md b/tools/spec-loop/IMPLEMENTATION_PLAN.md
diff --git a/tools/spec-loop/specs/adapters.md b/tools/spec-loop/specs/adapters.md
@@ -57,6 +57,15 @@ by swapping the adapter, not the skill.
   ([privacy-llm-gate.md](privacy-llm-gate.md)) before any LLM read.
 - **Write-back is confirm-before-apply** and routed through the sandbox's
   `ask` gate ([agent-isolation-sandbox.md](agent-isolation-sandbox.md)).
+- **Adapter READMEs are contracts.** Every adapter README declares the
+  capability it provides, prerequisites, credential/privacy handling,
+  supported operations, and adopter config keys. These fields let a
+  validator distinguish an intentional adapter surface from undocumented
+  shell prose.
+- **Private mail is hostile input.** Gmail, PonyMail, `mail-archive`, and
+  `mail-source` content is external data, never instructions. Tests for
+  mail adapters should include prompt-injection text in fetched mail and
+  prove it is carried as report data only after redaction/gating.
 
 ## Out of scope
 
@@ -69,6 +78,10 @@ by swapping the adapter, not the skill.
    adapter + placeholder.
 2. Mail adapters draft only and redact before LLM read.
 3. Each adapter ships with its own tests.
+4. Adapter READMEs declare capability, prerequisites,
+   privacy/credential handling, operations, and config keys.
+5. Mail-adapter tests prove private fetched content crosses the
+   Privacy-LLM/redaction boundary before model-facing skill context.
 
 ## Validation
 
@@ -86,3 +99,10 @@ done
   ASF coupling across the catalogue, and the capability-flag mechanism for
   workflow branches that no adapter resolves, live in
   [project-agnosticism.md](project-agnosticism.md).
+- **Adapter authoring smoke validation is missing.** The docs define the
+  expected README contract, but no validator currently checks that each
+  adapter declares capability, prerequisites, privacy/credential handling,
+  operations, and config keys.
+- **Mail-adapter privacy tests are thin.** The redaction contract exists,
+  but adapter-level fixtures should prove that private mail and embedded
+  prompt-injection attempts do not enter model-facing context untreated.
diff --git a/tools/spec-loop/specs/adoption-and-setup.md b/tools/spec-loop/specs/adoption-and-setup.md
@@ -61,6 +61,12 @@ gitignored skill symlinks, and committed agent-readable override files.
 - **Overrides are agent-readable Markdown** under
   `.apache-magpie-overrides/`, consulted at runtime and merged before
   default behaviour ([pairing/correctability is the model]).
+- **Overrides are additive, never authority inversion.** An override may
+  supply adopter-specific process details, paths, labels, or wording, but
+  it must not replace or weaken the framework's safety, confidentiality,
+  privacy, or external-content-as-data baseline. If an override conflicts
+  with those baseline rules, the framework rule wins and the conflict is
+  surfaced.
 
 ## Out of scope
 
@@ -73,6 +79,9 @@ gitignored skill symlinks, and committed agent-readable override files.
 1. Adoption commits only the bootstrap skill + lock/override scaffold.
 2. The committed lock re-installs the same version on a fresh clone.
 3. Drift between local and committed locks is surfaced with an upgrade.
+4. Override files can be discovered and surfaced to skills without
+   editing upstream skill bodies, and override text cannot weaken the
+   safety/confidentiality baseline.
 
 ## Validation
 
@@ -86,3 +95,7 @@ uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-valid
 - `stable`; gaps appear as new agent targets to add to the registry
   ([`agents.md`](../../../skills/setup/agents.md)) or new override
   surfaces — recorded by the plan pass.
+- **Override-file contract tests are missing.** The docs describe
+  agentic overrides, but no smoke fixture proves that clean overrides are
+  additive or that an override attempting to relax safety/confidentiality
+  rules is flagged rather than applied.
diff --git a/tools/spec-loop/specs/meta-and-quality-tooling.md b/tools/spec-loop/specs/meta-and-quality-tooling.md
@@ -64,6 +64,18 @@ trustworthy as it grows.
   heuristic/text tools with no model calls — reproducible in CI.
 - **Hard vs soft rules.** The validator fails on missing frontmatter or
   broken links; advisories are warnings unless `--strict`.
+- **Schema-backed metadata.** Skill frontmatter, tool README capability
+  declarations, and family/index docs are treated as machine-checkable
+  contracts. New checks should prefer clear enum/list validation over
+  prose inference when the repository already declares the vocabulary.
+- **Generated-index consistency.** Human-facing catalogue pages such as
+  `docs/modes.md` may stay hand-written, but validator checks should
+  compare their skill lists and counts against live `skills/*/SKILL.md`
+  frontmatter so documentation drift is visible before review.
+- **Pilot evidence is structured.** Experimental-family pilot reports
+  should capture the same minimal fields every time: skill/family,
+  target repo/profile, blocked preflights, false positives, confirmation
+  points, privacy/adapter notes, and proposed spec changes.
 
 ## Out of scope
 
@@ -76,6 +88,15 @@ trustworthy as it grows.
 1. `skill-and-tool-validate` enforces required frontmatter + link integrity.
 2. `list-skills` generates its index from live frontmatter.
 3. Each meta tool ships with its own tests.
+4. Frontmatter values for `mode`, `status`, `capability`,
+   `organization`, and `source` are validated against documented
+   vocabularies; unknown organizations fail unless the organization
+   exists under `organizations/`.
+5. Capabilities declared in skill frontmatter and tool READMEs are
+   present in `docs/labels-and-capabilities.md`; taxonomy entries with no
+   implementation are explicitly marked reserved or future.
+6. `docs/modes.md` skill lists and shipped counts are checked against
+   live skill frontmatter.
 
 ## Validation
 
@@ -86,10 +107,23 @@ uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-valid
 
 ## Known gaps
 
-- **Eval coverage is complete.** All 44 shipped skills have a matching
-  suite in `tools/skill-evals/evals/`; the soft eval-coverage check in
-  `skill-and-tool-validator` (check #8) warns when a newly added skill
-  has no suite, keeping coverage complete going forward.
-- Other gaps appear as new quality checks worth adding — recorded by the
-  plan pass. The spec validator (analogous to the skill validator) and
-  the ASF-coupling advisory lint are two recent additions to this surface.
+- **Eval coverage is complete.** Every shipped skill has a matching suite
+  in `tools/skill-evals/evals/`; the soft eval-coverage check in
+  `skill-and-tool-validator` warns when a newly added skill has no suite,
+  keeping coverage complete going forward.
+- **Frontmatter validation is still shallow.** Current validation covers
+  required fields, but the next pass should make `mode`, `status`,
+  `capability`, `organization`, and `source` combinations explicit and
+  test-backed.
+- **Capability taxonomy drift is not yet checked.** The validator should
+  catch misspelled or undocumented capability values, and should surface
+  taxonomy rows that no skill/tool implements unless they are marked
+  reserved.
+- **`docs/modes.md` is manually synced.** The plan tracks a generated
+  consistency check so mode tables and shipped counts cannot silently
+  drift from skill frontmatter.
+- **Tool README prerequisites vary.** A prerequisites consistency pass
+  should normalize older tool READMEs before tightening the validator.
+- **Pilot evidence has no common shape.** Experimental-family specs all
+  need adopter evidence, but there is no standard pilot-report template
+  or helper yet.
diff --git a/tools/spec-loop/specs/organization-adapters.md b/tools/spec-loop/specs/organization-adapters.md
@@ -73,6 +73,10 @@ identical "ASF default" values in its own `project.md`.
 - The `organizations/ASF/organization.md` keys mirror the namespaces of
   the project manifest's *Security workflow configuration* section so
   resolution is mechanical.
+- Organization smoke coverage should exercise more than one family. At
+  minimum, a non-ASF profile must be able to drive security intake
+  backend selection, release backend selection, and contributor-governance
+  defaults without editing skill bodies.
 
 ## Out of scope
 
@@ -94,6 +98,8 @@ identical "ASF default" values in its own `project.md`.
   first hit wins; no skill branches on the organization.
 - A new organization can be authored from `organizations/_template/` with
   no skill edits.
+- Smoke fixtures cover security intake, release backend, and contributor
+  governance defaults for at least one non-ASF profile.
 
 ## Validation
 
@@ -121,3 +127,7 @@ resolves to the baseline.
   convention).
 - The family-level `organization:` scope (replacing `asf: true/false`)
   and the external-adapter discovery index are separate follow-ups.
+- **Smoke coverage is narrow.** The non-ASF profile smoke currently proves
+  one path. The next coverage pass should exercise security intake,
+  release backend selection, and contributor governance so organization
+  defaults are tested across the surfaces most likely to drift.
diff --git a/tools/spec-loop/specs/pr-management-family.md b/tools/spec-loop/specs/pr-management-family.md
@@ -137,6 +137,9 @@ is listed here for navigability since its domain is PR threads.
 4. `pr-management-stats` emits read-only tables without mutating any
    tracker or PR state.
 5. All family skills pass `skill-and-tool-validate` with no errors.
+6. `pr-management-code-review` has a dedicated eval suite covering
+   selector resolution, review-risk classification, AI-generated-code
+   signals, prompt injection in PR content, and the final review handoff.
 
 ## Validation
 
@@ -161,9 +164,10 @@ uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-valid
   code-review skill ships without a matching suite in
   `tools/skill-evals/evals/pr-management-code-review/`; the SOFT
   eval-coverage check in `skill-and-tool-validator` flags this. Adding a
-  step-level fixture set (at minimum an adversarial prompt-injection case
-  and a typical APPROVE output) is the next concrete quality improvement
-  for this family.
+  step-level fixture set is the next concrete quality improvement for
+  this family. Minimum coverage: selector resolution, review-risk
+  classification, AI-generated-code signal handling, prompt injection in
+  PR body/comments, and a typical APPROVE / REQUEST_CHANGES handoff.
 - **Stale-PR handling is built into `pr-management-triage`.** Dedicated
   stale sweeps (`stale-draft`, `inactive-open`, `stale-review-ping`) run
   as Step 5 of the triage flow and can be invoked standalone via

diff --git a/tools/spec-loop/specs/privacy-llm-gate.md b/tools/spec-loop/specs/privacy-llm-gate.md
@@ -57,6 +57,12 @@ artefact for leakage before emission.
   and any project-declared private string; failures stop the flow.
 - **Audit log is privacy-aware** — references hashed identifiers, never
   raw PII.
+- **Public branch names are public artefacts.** Generated branch names,
+  commit-message examples, PR-body templates, changelog snippets, and
+  release-note text must avoid embargo-breaking security terms before
+  disclosure. In particular, pre-disclosure public branch names must not
+  contain CVE IDs, `security`, `vulnerability`, `advisory`, or
+  tracker-private title fragments.
 
 ## Out of scope
 
@@ -71,6 +77,8 @@ artefact for leakage before emission.
    map local (0600, gitignored).
 3. The scrub catches CVE IDs / reporter names / list addresses before any
    public write.
+4. Generated public branch-name examples are scrubbed for CVE IDs and
+   embargoed security framing before use.
 
 ## Validation
 
@@ -82,3 +90,7 @@ uv run --project tools/privacy-llm --group dev pytest
 
 - `stable`; gaps surface as new PII patterns or new public-emission
   surfaces not yet covered by the scrub — caught as drift by the plan pass.
+- **Branch-name confidentiality validation is missing.** Security-fix
+  workflows already require neutral branch names, but no deterministic
+  check scans skill/docs examples for CVE IDs or embargoed terms in
+  generated branch names.
diff --git a/tools/spec-loop/specs/project-agnosticism.md b/tools/spec-loop/specs/project-agnosticism.md
@@ -108,6 +108,11 @@ The three mechanisms, in order of preference:
 - **Advisory, not paternalistic.** The audit surfaces candidate coupling
   for a maintainer to judge; some ASF strings are legitimate (examples,
   the ASF default profile, ASF-specific docs). It does not auto-rewrite.
+- **Template and example profiles stay comparable.** `projects/_template/`
+  is the adopter contract; `projects/non-asf-example/` is the proof that
+  a non-ASF adopter can satisfy that contract. Required files and config
+  keys should be structurally comparable, with omissions explained rather
+  than silently drifting.
 
 ## Out of scope
 
@@ -129,6 +134,9 @@ The three mechanisms, in order of preference:
    `<project-config>` flag, not on skill edits.
 3. The ASF profile runs the catalogue unchanged (default-valued flags),
    and a non-ASF profile can be declared without editing any skill body.
+4. The template profile and non-ASF example expose the same required
+   config surfaces, except where the example documents an intentional
+   omission or an organization-inherited default.
 
 ## Validation
 
@@ -173,3 +181,8 @@ uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-valid
   catalogue (bare `PMC`, `ICLA`, `announce@apache.org`) is surfaced by the
   advisory lint (check #10 in `skill-and-tool-validator`) for human
   judgement.
+- **Template/profile drift is not mechanically checked.** The non-ASF
+  example is now a real smoke fixture, but no validator compares its file
+  and key surface against `projects/_template/`. A drift check should
+  catch missing required files, stale documented keys, and hidden
+  organization-default assumptions.
diff --git a/tools/spec-loop/specs/release-management-lifecycle.md b/tools/spec-loop/specs/release-management-lifecycle.md
@@ -119,6 +119,11 @@ code lands.
   state-changing lane, requires evidence from Release Managers and binding
   voters that the process is healthier (fewer stalled RCs, shorter
   time-to-`[ANNOUNCE]`, fewer reverted promotions).
+- **Audit records are structured.** `release-audit-report` output should
+  follow a schema/template with required fields for release version, RC
+  artefacts, vote thread, tally outcome, promotion revision, announcement
+  URL, archive state, and follow-up notes. Missing required fields are a
+  report finding, not silently omitted prose.
 
 ## Out of scope
 
@@ -139,6 +144,9 @@ code lands.
 3. No skill in the family signs, imports, promotes, sends, or merges on
    autopilot; the key-holding and publishing steps emit paste-ready
    recipes only.
+4. `release-audit-report` validates the audit record against the
+   required-field schema and flags incomplete lifecycle evidence before
+   proposing an audit-log PR.
 
 ## Validation
 
@@ -171,3 +179,7 @@ uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/release-an
   cut a full release through the family yet, so the RM/binding-voter
   evidence window that would justify default-on or a state-changing lane
   has no data behind it.
+- **Release audit record schema is prose-only.** The audit-report skill
+  exists, but there is no structured schema/template that downstream
+  review can validate. The plan tracks a schema and eval fixtures for
+  incomplete records.