Claude bot enhancements: observability, reliability, and prompt improvements

## Overview

After reviewing the current Claude integration (sync bot + review bot), here are proposed enhancements organized by priority. The current setup is solid — these are incremental improvements to reliability, observability, and security.

---

### 1. Add outcome tracking to sync bot runs

`usage-summary.py` tracks cost/tokens but not *outcomes*. Without this data we can't answer: how often does the bot succeed? Which steps consume the most turns? Is the prompt getting better or worse over time?

**Proposed:** Extend `usage-summary.py` (or add a new step) to log structured data:
- Success/failure and exit reason
- Update type attempted (relax/bump/skip)
- Highest step reached (e.g., "Step 5: Validate")
- Turns consumed per step (if derivable from execution log)
- Whether a WIP PR was created vs final PR

Output to job summary and optionally to a tracking issue or artifact for trend analysis.

---

### 2. Add circuit breaker for repeated sync bot failures

If the sync bot fails on a given botocore version, it retries every 3 days indefinitely — burning API budget with no human signal.

**Proposed:** Track consecutive failures (e.g., via a label or issue body). After N failures (suggest 3) on the same target version:
- Auto-create or update a feedback issue with failure context
- Skip subsequent runs until a human responds or the target version changes
- Include failure summaries to help diagnose the root cause

---

### 3. Add prompt integration test (dry-run validation)

Prompt edits can silently degrade the bot. Example: the recent `envsubst` bug erased `$VERSION` from git commands in the prompt, and this wasn't caught until a live run.

**Proposed:** Add a CI check (on PRs touching `.github/botocore-sync-prompt.md` or `botocore-sync.yml`) that:
- Runs the `envsubst` substitution with mock values and validates no unintended variables are erased
- Checks that all expected template variables (`$LATEST_BOTOCORE`, etc.) are present pre-substitution
- Optionally runs a dry-run against a known botocore version diff to validate the bot reaches the expected decision (relax vs bump)

---

### 4. Split sync prompt into composable modules

`botocore-sync-prompt.md` is 561 lines — essentially an entire program in English. Risks:
- Subtle instructions get lost in long context
- A single edit can break unrelated behavior
- Hard to test individual steps in isolation

**Proposed:** Split into a main orchestrator + per-step modules:
- `sync-main.md` — orchestrator with step routing logic
- `sync-step-4a-relax.md`, `sync-step-4b-bump.md` — detailed per-path instructions
- `sync-common.md` — shared rules (security, git operations, pre-commit)

The workflow would concatenate the relevant modules before passing to Claude. This also enables step-level prompt testing.

**Note:** This is the largest change and should be validated carefully — splitting context across files could hurt coherence if done poorly. Worth prototyping with a single step first.

---

### 5. Enhance review bot with domain-specific checks

The review bot is generic. Given aiobotocore's specialized async override patterns, it should explicitly check for:
- Missing `await` on I/O operations in `Aio*` classes
- Sync methods that should be async (overriding botocore methods that do I/O)
- Incorrect class naming (missing `Aio` prefix)
- Missing `resolve_awaitable()` on mixed sync/async callbacks
- Resource cleanup — async context managers for clients/sessions
- `test_patches.py` hash updates when patched code changes

This could be added to the review prompt or as a dedicated section in `CLAUDE.md` that the review bot reads.

---

### 6. Replace permission blocklist with tool allowlist

The sync bot uses `--dangerously-skip-permissions` with a `PreToolUse` hook blocking `git commit`. This is a blocklist — anything not blocked is allowed.

**Proposed:** Switch to an allowlist approach:
- Define explicit list of allowed tools/patterns (Bash commands, MCP operations, file operations)
- Block everything else by default
- This is a stronger security posture, especially as the bot's capabilities grow

**Caveat:** May require changes to `claude-code-action` or more granular hook logic. Worth evaluating feasibility before committing.

---

### Priority suggestion

| # | Enhancement | Effort | Impact |
|---|---|---|---|
| 1 | Outcome tracking | Low | High — enables all other optimization |
| 2 | Circuit breaker | Low | Medium — prevents waste on stuck versions |
| 3 | Prompt integration test | Medium | High — catches silent regressions |
| 4 | Split prompt modules | High | Medium — better maintainability long-term |
| 5 | Domain-specific review | Medium | Medium — catches real bugs in PRs |
| 6 | Tool allowlist | Medium | Low-Medium — security hardening |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Claude bot enhancements: observability, reliability, and prompt improvements #1535

Overview

1. Add outcome tracking to sync bot runs

2. Add circuit breaker for repeated sync bot failures

3. Add prompt integration test (dry-run validation)

4. Split sync prompt into composable modules

5. Enhance review bot with domain-specific checks

6. Replace permission blocklist with tool allowlist

Priority suggestion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

#	Enhancement	Effort	Impact
1	Outcome tracking	Low	High — enables all other optimization
2	Circuit breaker	Low	Medium — prevents waste on stuck versions
3	Prompt integration test	Medium	High — catches silent regressions
4	Split prompt modules	High	Medium — better maintainability long-term
5	Domain-specific review	Medium	Medium — catches real bugs in PRs
6	Tool allowlist	Medium	Low-Medium — security hardening

Uh oh!

Claude bot enhancements: observability, reliability, and prompt improvements #1535

Description

Overview

1. Add outcome tracking to sync bot runs

2. Add circuit breaker for repeated sync bot failures

3. Add prompt integration test (dry-run validation)

4. Split sync prompt into composable modules

5. Enhance review bot with domain-specific checks

6. Replace permission blocklist with tool allowlist

Priority suggestion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions