Skip to content

[Suggestion] Voice agent behavioral testing harness with YAML-driven scenarios and automated quality scoring (Python) #276

@deepgram-robot

Description

@deepgram-robot

What to build

A Python testing framework that lets developers define voice agent test scenarios in YAML, run them against Deepgram's Voice Agent API, and evaluate results using automated quality scoring — enabling CI/CD testing of voice agent behavior.

Why this matters

Developers deploying voice agents to production need automated testing to catch regressions — verifying that the agent greets correctly, handles edge cases, uses tools appropriately, and stays on-topic. Currently, testing voice agents requires manual conversation, which doesn't scale and can't be integrated into CI/CD pipelines. A YAML-driven test harness that simulates user turns (as text or audio), captures agent responses, and evaluates them against expected behaviors would be the foundation for production-grade voice agent development. This is becoming standard practice in the voice AI ecosystem and a key retention lever for the platform.

Suggested scope

  • Language: Python
  • Deepgram APIs: Voice Agent API
  • Key features:
    • YAML scenario definitions with user turns, expected agent behaviors, and success criteria
    • Text-mode testing (inject text as simulated STT) for fast iteration
    • Audio-mode testing (play audio files through the Voice Agent) for end-to-end validation
    • Automated scoring: LLM-as-judge evaluation of agent responses against criteria
    • Latency assertions: fail if time-to-first-response exceeds threshold
    • Function calling validation: verify correct tools were called with expected arguments
    • JUnit XML output for CI/CD integration
    • CLI runner: python -m dg_eval --scenarios ./tests/
  • Complexity: Medium-high — testing framework with LLM evaluation

Acceptance criteria

  • Runnable with minimal setup (clone, add API key, run)
  • README explains the YAML scenario format and evaluation methodology
  • Uses current SDK version
  • At least 5 example scenarios included (greeting, FAQ, tool use, edge case, off-topic)
  • Text-mode testing works without audio hardware
  • Produces pass/fail results with detailed scoring per scenario
  • JUnit XML output compatible with GitHub Actions / CI systems
  • Latency thresholds configurable per scenario

Raised by the DX intelligence system.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions