Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
479 changes: 373 additions & 106 deletions tools/spec-loop/IMPLEMENTATION_PLAN.md

Large diffs are not rendered by default.

20 changes: 20 additions & 0 deletions tools/spec-loop/specs/adapters.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,15 @@ by swapping the adapter, not the skill.
([privacy-llm-gate.md](privacy-llm-gate.md)) before any LLM read.
- **Write-back is confirm-before-apply** and routed through the sandbox's
`ask` gate ([agent-isolation-sandbox.md](agent-isolation-sandbox.md)).
- **Adapter READMEs are contracts.** Every adapter README declares the
capability it provides, prerequisites, credential/privacy handling,
supported operations, and adopter config keys. These fields let a
validator distinguish an intentional adapter surface from undocumented
shell prose.
- **Private mail is hostile input.** Gmail, PonyMail, `mail-archive`, and
`mail-source` content is external data, never instructions. Tests for
mail adapters should include prompt-injection text in fetched mail and
prove it is carried as report data only after redaction/gating.

## Out of scope

Expand All @@ -69,6 +78,10 @@ by swapping the adapter, not the skill.
adapter + placeholder.
2. Mail adapters draft only and redact before LLM read.
3. Each adapter ships with its own tests.
4. Adapter READMEs declare capability, prerequisites,
privacy/credential handling, operations, and config keys.
5. Mail-adapter tests prove private fetched content crosses the
Privacy-LLM/redaction boundary before model-facing skill context.

## Validation

Expand All @@ -86,3 +99,10 @@ done
ASF coupling across the catalogue, and the capability-flag mechanism for
workflow branches that no adapter resolves, live in
[project-agnosticism.md](project-agnosticism.md).
- **Adapter authoring smoke validation is missing.** The docs define the
expected README contract, but no validator currently checks that each
adapter declares capability, prerequisites, privacy/credential handling,
operations, and config keys.
- **Mail-adapter privacy tests are thin.** The redaction contract exists,
but adapter-level fixtures should prove that private mail and embedded
prompt-injection attempts do not enter model-facing context untreated.
13 changes: 13 additions & 0 deletions tools/spec-loop/specs/adoption-and-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,12 @@ gitignored skill symlinks, and committed agent-readable override files.
- **Overrides are agent-readable Markdown** under
`.apache-magpie-overrides/`, consulted at runtime and merged before
default behaviour ([pairing/correctability is the model]).
- **Overrides are additive, never authority inversion.** An override may
supply adopter-specific process details, paths, labels, or wording, but
it must not replace or weaken the framework's safety, confidentiality,
privacy, or external-content-as-data baseline. If an override conflicts
with those baseline rules, the framework rule wins and the conflict is
surfaced.

## Out of scope

Expand All @@ -73,6 +79,9 @@ gitignored skill symlinks, and committed agent-readable override files.
1. Adoption commits only the bootstrap skill + lock/override scaffold.
2. The committed lock re-installs the same version on a fresh clone.
3. Drift between local and committed locks is surfaced with an upgrade.
4. Override files can be discovered and surfaced to skills without
editing upstream skill bodies, and override text cannot weaken the
safety/confidentiality baseline.

## Validation

Expand All @@ -86,3 +95,7 @@ uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-valid
- `stable`; gaps appear as new agent targets to add to the registry
([`agents.md`](../../../skills/setup/agents.md)) or new override
surfaces — recorded by the plan pass.
- **Override-file contract tests are missing.** The docs describe
agentic overrides, but no smoke fixture proves that clean overrides are
additive or that an override attempting to relax safety/confidentiality
rules is flagged rather than applied.
48 changes: 41 additions & 7 deletions tools/spec-loop/specs/meta-and-quality-tooling.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,18 @@ trustworthy as it grows.
heuristic/text tools with no model calls — reproducible in CI.
- **Hard vs soft rules.** The validator fails on missing frontmatter or
broken links; advisories are warnings unless `--strict`.
- **Schema-backed metadata.** Skill frontmatter, tool README capability
declarations, and family/index docs are treated as machine-checkable
contracts. New checks should prefer clear enum/list validation over
prose inference when the repository already declares the vocabulary.
- **Generated-index consistency.** Human-facing catalogue pages such as
`docs/modes.md` may stay hand-written, but validator checks should
compare their skill lists and counts against live `skills/*/SKILL.md`
frontmatter so documentation drift is visible before review.
- **Pilot evidence is structured.** Experimental-family pilot reports
should capture the same minimal fields every time: skill/family,
target repo/profile, blocked preflights, false positives, confirmation
points, privacy/adapter notes, and proposed spec changes.

## Out of scope

Expand All @@ -76,6 +88,15 @@ trustworthy as it grows.
1. `skill-and-tool-validate` enforces required frontmatter + link integrity.
2. `list-skills` generates its index from live frontmatter.
3. Each meta tool ships with its own tests.
4. Frontmatter values for `mode`, `status`, `capability`,
`organization`, and `source` are validated against documented
vocabularies; unknown organizations fail unless the organization
exists under `organizations/`.
5. Capabilities declared in skill frontmatter and tool READMEs are
present in `docs/labels-and-capabilities.md`; taxonomy entries with no
implementation are explicitly marked reserved or future.
6. `docs/modes.md` skill lists and shipped counts are checked against
live skill frontmatter.

## Validation

Expand All @@ -86,10 +107,23 @@ uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-valid

## Known gaps

- **Eval coverage is complete.** All 44 shipped skills have a matching
suite in `tools/skill-evals/evals/`; the soft eval-coverage check in
`skill-and-tool-validator` (check #8) warns when a newly added skill
has no suite, keeping coverage complete going forward.
- Other gaps appear as new quality checks worth adding — recorded by the
plan pass. The spec validator (analogous to the skill validator) and
the ASF-coupling advisory lint are two recent additions to this surface.
- **Eval coverage is complete.** Every shipped skill has a matching suite
in `tools/skill-evals/evals/`; the soft eval-coverage check in
`skill-and-tool-validator` warns when a newly added skill has no suite,
keeping coverage complete going forward.
- **Frontmatter validation is still shallow.** Current validation covers
required fields, but the next pass should make `mode`, `status`,
`capability`, `organization`, and `source` combinations explicit and
test-backed.
- **Capability taxonomy drift is not yet checked.** The validator should
catch misspelled or undocumented capability values, and should surface
taxonomy rows that no skill/tool implements unless they are marked
reserved.
- **`docs/modes.md` is manually synced.** The plan tracks a generated
consistency check so mode tables and shipped counts cannot silently
drift from skill frontmatter.
- **Tool README prerequisites vary.** A prerequisites consistency pass
should normalize older tool READMEs before tightening the validator.
- **Pilot evidence has no common shape.** Experimental-family specs all
need adopter evidence, but there is no standard pilot-report template
or helper yet.
10 changes: 10 additions & 0 deletions tools/spec-loop/specs/organization-adapters.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,10 @@ identical "ASF default" values in its own `project.md`.
- The `organizations/ASF/organization.md` keys mirror the namespaces of
the project manifest's *Security workflow configuration* section so
resolution is mechanical.
- Organization smoke coverage should exercise more than one family. At
minimum, a non-ASF profile must be able to drive security intake
backend selection, release backend selection, and contributor-governance
defaults without editing skill bodies.

## Out of scope

Expand All @@ -94,6 +98,8 @@ identical "ASF default" values in its own `project.md`.
first hit wins; no skill branches on the organization.
- A new organization can be authored from `organizations/_template/` with
no skill edits.
- Smoke fixtures cover security intake, release backend, and contributor
governance defaults for at least one non-ASF profile.

## Validation

Expand Down Expand Up @@ -121,3 +127,7 @@ resolves to the baseline.
convention).
- The family-level `organization:` scope (replacing `asf: true/false`)
and the external-adapter discovery index are separate follow-ups.
- **Smoke coverage is narrow.** The non-ASF profile smoke currently proves
one path. The next coverage pass should exercise security intake,
release backend selection, and contributor governance so organization
defaults are tested across the surfaces most likely to drift.
10 changes: 7 additions & 3 deletions tools/spec-loop/specs/pr-management-family.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,9 @@ is listed here for navigability since its domain is PR threads.
4. `pr-management-stats` emits read-only tables without mutating any
tracker or PR state.
5. All family skills pass `skill-and-tool-validate` with no errors.
6. `pr-management-code-review` has a dedicated eval suite covering
selector resolution, review-risk classification, AI-generated-code
signals, prompt injection in PR content, and the final review handoff.

## Validation

Expand All @@ -161,9 +164,10 @@ uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-valid
code-review skill ships without a matching suite in
`tools/skill-evals/evals/pr-management-code-review/`; the SOFT
eval-coverage check in `skill-and-tool-validator` flags this. Adding a
step-level fixture set (at minimum an adversarial prompt-injection case
and a typical APPROVE output) is the next concrete quality improvement
for this family.
step-level fixture set is the next concrete quality improvement for
this family. Minimum coverage: selector resolution, review-risk
classification, AI-generated-code signal handling, prompt injection in
PR body/comments, and a typical APPROVE / REQUEST_CHANGES handoff.
- **Stale-PR handling is built into `pr-management-triage`.** Dedicated
stale sweeps (`stale-draft`, `inactive-open`, `stale-review-ping`) run
as Step 5 of the triage flow and can be invoked standalone via
Expand Down
12 changes: 12 additions & 0 deletions tools/spec-loop/specs/privacy-llm-gate.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,12 @@ artefact for leakage before emission.
and any project-declared private string; failures stop the flow.
- **Audit log is privacy-aware** — references hashed identifiers, never
raw PII.
- **Public branch names are public artefacts.** Generated branch names,
commit-message examples, PR-body templates, changelog snippets, and
release-note text must avoid embargo-breaking security terms before
disclosure. In particular, pre-disclosure public branch names must not
contain CVE IDs, `security`, `vulnerability`, `advisory`, or
tracker-private title fragments.

## Out of scope

Expand All @@ -71,6 +77,8 @@ artefact for leakage before emission.
map local (0600, gitignored).
3. The scrub catches CVE IDs / reporter names / list addresses before any
public write.
4. Generated public branch-name examples are scrubbed for CVE IDs and
embargoed security framing before use.

## Validation

Expand All @@ -82,3 +90,7 @@ uv run --project tools/privacy-llm --group dev pytest

- `stable`; gaps surface as new PII patterns or new public-emission
surfaces not yet covered by the scrub — caught as drift by the plan pass.
- **Branch-name confidentiality validation is missing.** Security-fix
workflows already require neutral branch names, but no deterministic
check scans skill/docs examples for CVE IDs or embargoed terms in
generated branch names.
13 changes: 13 additions & 0 deletions tools/spec-loop/specs/project-agnosticism.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,11 @@ The three mechanisms, in order of preference:
- **Advisory, not paternalistic.** The audit surfaces candidate coupling
for a maintainer to judge; some ASF strings are legitimate (examples,
the ASF default profile, ASF-specific docs). It does not auto-rewrite.
- **Template and example profiles stay comparable.** `projects/_template/`
is the adopter contract; `projects/non-asf-example/` is the proof that
a non-ASF adopter can satisfy that contract. Required files and config
keys should be structurally comparable, with omissions explained rather
than silently drifting.

## Out of scope

Expand All @@ -129,6 +134,9 @@ The three mechanisms, in order of preference:
`<project-config>` flag, not on skill edits.
3. The ASF profile runs the catalogue unchanged (default-valued flags),
and a non-ASF profile can be declared without editing any skill body.
4. The template profile and non-ASF example expose the same required
config surfaces, except where the example documents an intentional
omission or an organization-inherited default.

## Validation

Expand Down Expand Up @@ -173,3 +181,8 @@ uv run --project tools/skill-and-tool-validator --group dev skill-and-tool-valid
catalogue (bare `PMC`, `ICLA`, `announce@apache.org`) is surfaced by the
advisory lint (check #10 in `skill-and-tool-validator`) for human
judgement.
- **Template/profile drift is not mechanically checked.** The non-ASF
example is now a real smoke fixture, but no validator compares its file
and key surface against `projects/_template/`. A drift check should
catch missing required files, stale documented keys, and hidden
organization-default assumptions.
12 changes: 12 additions & 0 deletions tools/spec-loop/specs/release-management-lifecycle.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,11 @@ code lands.
state-changing lane, requires evidence from Release Managers and binding
voters that the process is healthier (fewer stalled RCs, shorter
time-to-`[ANNOUNCE]`, fewer reverted promotions).
- **Audit records are structured.** `release-audit-report` output should
follow a schema/template with required fields for release version, RC
artefacts, vote thread, tally outcome, promotion revision, announcement
URL, archive state, and follow-up notes. Missing required fields are a
report finding, not silently omitted prose.

## Out of scope

Expand All @@ -139,6 +144,9 @@ code lands.
3. No skill in the family signs, imports, promotes, sends, or merges on
autopilot; the key-holding and publishing steps emit paste-ready
recipes only.
4. `release-audit-report` validates the audit record against the
required-field schema and flags incomplete lifecycle evidence before
proposing an audit-log PR.

## Validation

Expand Down Expand Up @@ -171,3 +179,7 @@ uv run --project tools/skill-evals skill-eval tools/skill-evals/evals/release-an
cut a full release through the family yet, so the RM/binding-voter
evidence window that would justify default-on or a state-changing lane
has no data behind it.
- **Release audit record schema is prose-only.** The audit-report skill
exists, but there is no structured schema/template that downstream
review can validate. The plan tracks a schema and eval fixtures for
incomplete records.