Skip to content

Claude bot enhancements: observability, reliability, and prompt improvements #1535

@thehesiod

Description

@thehesiod

Overview

After reviewing the current Claude integration (sync bot + review bot), here are proposed enhancements organized by priority. The current setup is solid — these are incremental improvements to reliability, observability, and security.


1. Add outcome tracking to sync bot runs

usage-summary.py tracks cost/tokens but not outcomes. Without this data we can't answer: how often does the bot succeed? Which steps consume the most turns? Is the prompt getting better or worse over time?

Proposed: Extend usage-summary.py (or add a new step) to log structured data:

  • Success/failure and exit reason
  • Update type attempted (relax/bump/skip)
  • Highest step reached (e.g., "Step 5: Validate")
  • Turns consumed per step (if derivable from execution log)
  • Whether a WIP PR was created vs final PR

Output to job summary and optionally to a tracking issue or artifact for trend analysis.


2. Add circuit breaker for repeated sync bot failures

If the sync bot fails on a given botocore version, it retries every 3 days indefinitely — burning API budget with no human signal.

Proposed: Track consecutive failures (e.g., via a label or issue body). After N failures (suggest 3) on the same target version:

  • Auto-create or update a feedback issue with failure context
  • Skip subsequent runs until a human responds or the target version changes
  • Include failure summaries to help diagnose the root cause

3. Add prompt integration test (dry-run validation)

Prompt edits can silently degrade the bot. Example: the recent envsubst bug erased $VERSION from git commands in the prompt, and this wasn't caught until a live run.

Proposed: Add a CI check (on PRs touching .github/botocore-sync-prompt.md or botocore-sync.yml) that:

  • Runs the envsubst substitution with mock values and validates no unintended variables are erased
  • Checks that all expected template variables ($LATEST_BOTOCORE, etc.) are present pre-substitution
  • Optionally runs a dry-run against a known botocore version diff to validate the bot reaches the expected decision (relax vs bump)

4. Split sync prompt into composable modules

botocore-sync-prompt.md is 561 lines — essentially an entire program in English. Risks:

  • Subtle instructions get lost in long context
  • A single edit can break unrelated behavior
  • Hard to test individual steps in isolation

Proposed: Split into a main orchestrator + per-step modules:

  • sync-main.md — orchestrator with step routing logic
  • sync-step-4a-relax.md, sync-step-4b-bump.md — detailed per-path instructions
  • sync-common.md — shared rules (security, git operations, pre-commit)

The workflow would concatenate the relevant modules before passing to Claude. This also enables step-level prompt testing.

Note: This is the largest change and should be validated carefully — splitting context across files could hurt coherence if done poorly. Worth prototyping with a single step first.


5. Enhance review bot with domain-specific checks

The review bot is generic. Given aiobotocore's specialized async override patterns, it should explicitly check for:

  • Missing await on I/O operations in Aio* classes
  • Sync methods that should be async (overriding botocore methods that do I/O)
  • Incorrect class naming (missing Aio prefix)
  • Missing resolve_awaitable() on mixed sync/async callbacks
  • Resource cleanup — async context managers for clients/sessions
  • test_patches.py hash updates when patched code changes

This could be added to the review prompt or as a dedicated section in CLAUDE.md that the review bot reads.


6. Replace permission blocklist with tool allowlist

The sync bot uses --dangerously-skip-permissions with a PreToolUse hook blocking git commit. This is a blocklist — anything not blocked is allowed.

Proposed: Switch to an allowlist approach:

  • Define explicit list of allowed tools/patterns (Bash commands, MCP operations, file operations)
  • Block everything else by default
  • This is a stronger security posture, especially as the bot's capabilities grow

Caveat: May require changes to claude-code-action or more granular hook logic. Worth evaluating feasibility before committing.


Priority suggestion

# Enhancement Effort Impact
1 Outcome tracking Low High — enables all other optimization
2 Circuit breaker Low Medium — prevents waste on stuck versions
3 Prompt integration test Medium High — catches silent regressions
4 Split prompt modules High Medium — better maintainability long-term
5 Domain-specific review Medium Medium — catches real bugs in PRs
6 Tool allowlist Medium Low-Medium — security hardening

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions