[Suggestion] Voice agent behavioral testing harness with YAML-driven scenarios and automated quality scoring (Python)

## What to build

A Python testing framework that lets developers define voice agent test scenarios in YAML, run them against Deepgram's Voice Agent API, and evaluate results using automated quality scoring — enabling CI/CD testing of voice agent behavior.

## Why this matters

Developers deploying voice agents to production need automated testing to catch regressions — verifying that the agent greets correctly, handles edge cases, uses tools appropriately, and stays on-topic. Currently, testing voice agents requires manual conversation, which doesn't scale and can't be integrated into CI/CD pipelines. A YAML-driven test harness that simulates user turns (as text or audio), captures agent responses, and evaluates them against expected behaviors would be the foundation for production-grade voice agent development. This is becoming standard practice in the voice AI ecosystem and a key retention lever for the platform.

## Suggested scope

- **Language:** Python
- **Deepgram APIs:** Voice Agent API
- **Key features:**
  - YAML scenario definitions with user turns, expected agent behaviors, and success criteria
  - Text-mode testing (inject text as simulated STT) for fast iteration
  - Audio-mode testing (play audio files through the Voice Agent) for end-to-end validation
  - Automated scoring: LLM-as-judge evaluation of agent responses against criteria
  - Latency assertions: fail if time-to-first-response exceeds threshold
  - Function calling validation: verify correct tools were called with expected arguments
  - JUnit XML output for CI/CD integration
  - CLI runner: `python -m dg_eval --scenarios ./tests/`
- **Complexity:** Medium-high — testing framework with LLM evaluation

## Acceptance criteria

- [ ] Runnable with minimal setup (clone, add API key, run)
- [ ] README explains the YAML scenario format and evaluation methodology
- [ ] Uses current SDK version
- [ ] At least 5 example scenarios included (greeting, FAQ, tool use, edge case, off-topic)
- [ ] Text-mode testing works without audio hardware
- [ ] Produces pass/fail results with detailed scoring per scenario
- [ ] JUnit XML output compatible with GitHub Actions / CI systems
- [ ] Latency thresholds configurable per scenario

---
*Raised by the DX intelligence system.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Suggestion] Voice agent behavioral testing harness with YAML-driven scenarios and automated quality scoring (Python) #276

What to build

Why this matters

Suggested scope

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Suggestion] Voice agent behavioral testing harness with YAML-driven scenarios and automated quality scoring (Python) #276

Description

What to build

Why this matters

Suggested scope

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions