Running Simulations
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
system_prompt | str | Yes | - | System prompt for the voice agent |
tools | list | Yes | - | List of tool definitions |
personas | list | Yes | - | List of persona dicts with ‘characteristics’, ‘gender’, ‘language’, optional ‘interruption_sensitivity’ |
scenarios | list | Yes | - | List of scenario dicts with ‘description’ |
evaluation_criteria | list | Yes | - | List of criteria dicts with ‘name’ and ‘description’ |
output_dir | str | No | ”./out” | Output directory for results |
stt | STTConfig | No | Speech to Text configuration | |
tts | TTSConfig | No | Text to Speech configuration | |
llm | LLMConfig | No | OpenRouter/gpt-4.1 | LLM configuration |
agent_speaks_first | bool | No | True | Whether agent initiates conversation |
max_turns | int | No | 50 | Maximum assistant turns |
port | int | No | 8765 | Base WebSocket port |
- Speech to Text: deepgram, google, openai, elevenlabs, sarvam, cartesia
- LLM: openrouter, openai
- Text to Speech: cartesia, google, openai, elevenlabs, sarvam, deepgram
Metrics
Voice simulations evaluate multiple aspects:- Evaluation Criteria Match: Each criterion evaluated as True/False with reasoning
- Speech to Text LLM Judge Score: Accuracy of transcriptions during simulation
- Latency Metrics: TTFB and processing time for each component
Learn more about metrics
Detailed explanation of all metrics and LLM Judge
Output Structure
Directory Contents
Eachsimulation_persona_*_scenario_* directory contains:
| File | Description |
|---|---|
audios/ | Alternating *_user.wav and *_bot.wav clips for every turn |
logs | Full logs of the simulation including all pipecat logs |
results.log | Terminal output of the simulation |
evaluation_results.csv | Per-criterion evaluation results including latency metrics |
stt_results.csv | Detailed per-row Speech to Text evaluation results |
metrics.json | Latency traces for Speech to Text, LLM, and Text to Speech providers |
stt_outputs.json | Output of the Speech to Text step for each turn |
tool_calls.json | Chronologically ordered tool calls made by the agent |
transcript.json | Full conversation transcript |
config.json | Persona and scenario used for this simulation |
conversation.wav | Combined audio of the entire conversation |
config.json
Contains the persona and scenario used for each simulation:evaluation_results.csv
Contains evaluation results for each criterion, latency metrics, and Speech to Text score:| name | value | reasoning |
|---|---|---|
| question_completeness | 1 | The assistant asked for the user’s full name, address, and telephone number… |
| assistant_behavior | 1 | The assistant asked one concise question per turn… |
| ttft | 0.6209 | |
| processing_time | 0.6209 | |
| stt_llm_judge_score | 0.95 |
ttft and processing_time), one row is added per processor with the mean value. Evaluation criteria use value of 1 for True (match) and 0 for False (no match).
stt_results.csv
Contains detailed per-row Speech to Text evaluation results:| reference | prediction | score | reasoning |
|---|---|---|---|
| Hi. | Hi. | 1 | The transcription matches exactly. |
| Geeta Prasad. | Gita Prasad. | 0 | The name ‘Geeta’ was transcribed as ‘Gita’… |
results.csv
Aggregates match scores across all simulations:| name | question_completeness | assistant_behavior | stt_llm_judge_score |
|---|---|---|---|
| simulation_persona_1_scenario_1 | 1.0 | 1.0 | 0.95 |
| simulation_persona_1_scenario_2 | 1.0 | 0.0 | 0.92 |