Eval Files
Evaluation files define the test cases, graders, workspace lifecycle, and run
controls for an evaluation run. The reserved tags.experiment key is the
run/result grouping label, top-level target identifies the system under test,
and fields such as evaluate_options.repeat, threshold, timeout_seconds,
evaluate_options.budget_usd, and evaluate_options.max_concurrency control repeated
attempts and gates. Workspace lifetime belongs under workspace.scope;
repository provenance belongs under workspace.repos; Docker/container binding
belongs under workspace.docker. Non-provisioning setup commands belong in
top-level extensions; reset policy stays under
workspace.hooks.after_each.reset; runner-specific setup belongs in the
target object, in targets, or in project config. AgentV supports two eval
data formats: YAML and JSONL.
YAML is the canonical portable model. TypeScript helpers, generated fixtures, and Python scripts should lower to the same YAML/JSONL shapes rather than inventing a separate eval contract.
Eval files describe the task, target binding, and run controls. Use
evaluate_options.max_concurrency for authored suite concurrency. Operators can still
override concurrency with agentv eval --workers N; do not author legacy
workers fields in eval YAML.
For Promptfoo-style authoring, AgentV uses the same broad prompt, test, vars,
default test, evaluate-options, and assertion contract with snake_case wire
fields. See the Promptfoo parity matrix for
the exact alignments, intentional AgentV extensions, and future-scope Promptfoo
surfaces.
Authoring Shapes
Section titled “Authoring Shapes”Eval YAML is AgentV’s composable and runnable authoring primitive. It is a
focused, shareable slice of the same config graph as .agentv/config.yaml.
Use ordinary *.eval.yaml files for direct task suites and for wrapper evals
that compose other suites. Raw case files are reusable data inputs, not a
second runnable experiment format.
- A task suite is eval YAML that owns task context:
workspace, sharedinput, sharedassert, fixtures, graders, and test cases. It can run directly or be imported throughimports.suites. - A raw case file is a YAML, JSON, JSONL, CSV, script-backed dataset,
directory, or glob of cases. Import it with
imports.tests,tests: ./cases.yaml,tests: file://cases.csv, or string shorthand; parent suite context applies because raw cases do not carry their own suite context. - A wrapper eval is eval YAML that imports one or more suites with
imports.suitesand binds run controls with top-leveltarget,threshold,timeout_seconds, andevaluate_options. Wrapper evals can live anywhere in the repo. A wrapper that imports suites withimports.suitesmust not define parentworkspace; imported suites own task environment. Machine-local existing workspace paths belong in CLI flags orconfig.local.yaml, not eval YAML.
For example, a reusable task suite can keep the task contract in one file:
suite: refundsworkspace: repos: - path: ./support-app repo: acme/support-app commit: maininput: Answer using the refund policy in the workspace.assert: - Applies the refund policy correctlytests: - id: missing-receipt input: Can this customer get a refund without a receipt?Raw cases are just case data:
- id: damaged-item input: The item arrived damaged. What should support do? expected_output: Offer a replacement or refund path.A wrapper eval stays ordinary eval YAML while choosing a target and run controls:
name: refunds-codextarget: codex-gpt5evaluate_options: max_concurrency: 3 repeat: count: 2 strategy: pass_any
imports: suites: - path: ../evals/suites/refunds.eval.yaml tests: - path: ../evals/cases/refund-smoke.cases.yaml
tests: - id: local-edge-case vars: question: Can a final-sale item be refunded after damage in transit? expected_output: Explain the final-sale exception for damaged transit.The experiments/ directory in that example is optional and user-owned. AgentV
does not infer behavior from the path; the wrapper runs because it is eval YAML
with tests or imports. The wrapper owns target selection and run controls. Put
workspace setup in imported child suites. Parent workspace-affecting fields,
including top-level workspace, are for parent-owned raw cases, including
cases imported with imports.tests. Runtime workspace path overrides belong in
CLI flags or .agentv/config.local.yaml; repos, hooks, templates, Docker
config, env checks, and workspace scope belong in top-level or case-level
workspace.
YAML Format
Section titled “YAML Format”The primary format. A single file contains metadata, inline runtime config, and tests:
description: Math problem solving evaluationtarget: default
prompts: - "{{ question }}"
assert: - Correctly calculates the answer - Explains the calculation briefly
tests: - id: addition vars: question: What is 15 + 27? expected_output: "42"Top-level Fields
Section titled “Top-level Fields”| Field | Description |
|---|---|
description | Human-readable description of the evaluation |
suite | Optional suite identifier |
category | Optional slash-delimited analytics taxonomy path. Overrides the category derived from the eval file path. |
target | System under test by configured target id or inline target object |
tags | Optional metadata map. Use tags.experiment as the run/result grouping label. |
prompts | Optional top-level prompt matrix. Entries can be strings, chat message arrays, files, or generated prompt functions rendered with tests[].vars and default_test.vars. |
targets | Optional target matrix. Entries reference target ids or inline target objects. |
evaluate_options.repeat | Optional repeat policy as a positive integer shorthand or object with count, strategy, early_exit, and cost_limit_usd |
evaluate_options | Optional evaluation runtime options such as budget_usd, repeat, and max_concurrency |
timeout_seconds | Optional per-case timeout |
threshold | Optional suite quality threshold |
workspace | Suite-level task environment — inline object or string path to an external workspace file. Repo entries declare identity and checkout pins; acquisition is covered in Workspace Architecture. |
extensions | Lifecycle hooks: file://path/to/hooks.mjs:beforeAll, beforeEach, afterEach, afterAll, plus the built-in agentv:agent-rules. Hooks run after workspace.repos materializes. |
imports | Optional import groups. imports.suites imports full child eval suites with their task context. imports.tests imports raw test rows into this file’s context. Import entries may use scoped run: overrides for threshold, repeat, timeout_seconds, and budget_usd. |
tests | Inline raw tests or a string path to an external raw-case file or directory. Legacy tests[].include entries still load with a migration warning; prefer imports.suites or imports.tests. |
assert | Suite-level graders appended to each test unless execution.skip_defaults: true is set on the test |
workspace is what the agent can inspect or modify through tools, not prompt
input. Put task instructions and chat/system/user messages in prompts; put
repos, templates, Docker config, env checks, scope, and repo provenance in
workspace. Put lifecycle setup that does not acquire repos in extensions.
For historical or repo-state evals, put the checkout under
workspace.repos[].commit. A commit SHA in the prompt or metadata is useful
context, but it does not materialize a repo for the agent to inspect.
Prompts, Vars, and Target Expansion
Section titled “Prompts, Vars, and Target Expansion”Use top-level prompts when you want the Promptfoo-compatible authoring shape:
prompt templates at the top level, test data in tests[].vars, and shared
test-data defaults in default_test.vars. AgentV renders each prompt with the
merged vars for each test, then expands the run as prompts x targets x tests x repeat before execution. Each expanded row keeps the original test_id plus
prompt and target identity for Dashboard filtering, reruns, and comparisons.
description: Release-note summarizationtags: experiment: prompt-matrix
prompts: - id: direct label: Direct prompt: "Summarize {{ topic }} for {{ audience }}." - id: terse label: Terse prompt: "In one sentence, summarize {{ topic }} for {{ audience }}."
default_test: vars: audience: engineers
targets: - id: local-mini provider: openai runtime: host config: model: gpt-5.4-mini - id: local-codex provider: codex-app-server runtime: host config: command: ["codex", "app-server"] model: gpt-5-codex
tests: - id: release-notes vars: topic: the July release notes expected_output: concise release-note summary assert: - Identifies the most important change - Avoids unsupported details - id: roadmap vars: audience: executives topic: the next roadmap phase expected_output: concise roadmap summary assert: - Identifies the main product direction - Uses the requested audience framingFor prompt matrices, put per-case data in tests[].vars and shared defaults in
default_test.vars. tests[].vars overrides default_test.vars by key.
Prompt templates can use either {{ name }} or {{ vars.name }} placeholders;
the top-level form matches Promptfoo-style prompt templates, while the
vars.* namespace is explicit and useful when a key might collide with other
template context.
Do not author direct input fields in normal eval YAML. tests[].input and
top-level input are removed from public authored eval files. Put prompt text,
system messages, and user messages in top-level prompts; put row data in
tests[].vars and shared defaults in default_test.vars.
For simple task text, use a prompt template such as "{{ input }}" or
"{{ vars.input }}" with:
tests: - id: direct vars: input: Summarize the July release notes.External raw-case files imported through tests: file://... or
imports.tests may still contain raw internal input rows for compatibility.
Keep that compatibility out of normal eval YAML authoring.
Lifecycle Extensions
Section titled “Lifecycle Extensions”extensions uses AgentV lifecycle names. File hooks are local
JavaScript or TypeScript modules resolved relative to the eval file:
extensions: - file://scripts/setup.mjs:beforeAll - file://scripts/setup.mjs:beforeEach - file://scripts/setup.mjs:afterEach - file://scripts/setup.mjs:afterAllEach exported function receives a context object with snake_case keys such as
workspace_path, test_id, eval_run_id, case_input, and case_metadata.
Setup hook failures (beforeAll, beforeEach) fail the affected run; teardown
hook failures (afterEach, afterAll) are non-fatal.
agentv:agent-rules is the only built-in extension in this slice. It runs after
workspace materialization and exposes staged rule paths to providers and result
metadata as agent_rules_paths:
extensions: - id: agentv:agent-rules hook: beforeAll skills: agent-rules/skills hooks: agent-rules/hooks agents: agent-rules/agents rules: agent-rules/AGENTS.mdIf agentv:agent-rules is authored as a string, it defaults to beforeAll and
discovers conventional rule locations already present in the materialized
workspace. It does not clone repositories or replace workspace.repos.
Metadata Fields
Section titled “Metadata Fields”You can add structured metadata to your eval file using these optional top-level fields. Metadata is parsed when the name field is present:
| Field | Description |
|---|---|
name | Machine-readable identifier (lowercase, hyphens, max 64 chars). Triggers metadata parsing. |
description | Human-readable description (max 1024 chars) |
version | Eval version string (e.g., "1.0") |
author | Author or team identifier |
tags | Array of string tags for categorization |
license | License identifier (e.g., "MIT", "Apache-2.0") |
requires | Dependency constraints (e.g., agentv: ">=0.30.0") |
name: export-screeningdescription: Evaluates export control screening accuracyversion: "1.0"author: acme-compliancetags: [compliance, agents]license: Apache-2.0requires: agentv: ">=0.30.0"
tests: - id: denied-party criteria: Identifies denied parties correctly input: Screen "Acme Corp" against denied parties listWhen category is omitted, AgentV derives it from the eval file path. Generic
filenames do not add a leaf: security/eval.yaml becomes security, and
security/network/suite.yaml becomes security/network. A meaningful
named eval file contributes a leaf, so security/network.eval.yaml becomes
security/network. Existing flat category strings remain valid one-node
category paths.
Suite-level Assert
Section titled “Suite-level Assert”The assert field is the canonical way to define suite-level graders. Suite-level assertions are appended to every test’s graders unless a test sets execution.skip_defaults: true.
For semantic or agent-behavior checks, prefer plain assertion strings first;
AgentV treats them as rubric criteria. Use deterministic assertions or script
graders when the expected output is exact or requires programmatic inspection.
If the assertion strings already state the grading contract, omit a duplicate
criteria field on each test. Use explicit type: llm-rubric entries only
when you need a custom prompt, a custom grader target, or a deliberately
separate grader panel.
description: API response validationassert: - type: is-json required: true - type: contains value: "status" - Correctly answers the user's question - Explains the reasoning clearly
tests: - id: health-check input: Check API healthassert supports rubric shorthand strings, deterministic assertion types
(contains, regex, is-json, equals), llm-rubric, and script
graders. See Tests for
per-test assert usage.
Assert Includes
Section titled “Assert Includes”Reusable assertion sets can be factored into template files and referenced from any assert array:
assert: - include: safe-response - include: ./shared/format.yamlResolution rules:
include: nameresolves to.agentv/templates/{name}.yamlwith the closest matching directory winning- Relative paths resolve from the eval file location, so
include: ./shared/format.yamlworks as expected - Nested includes are allowed up to depth 3 to keep cycles and runaway recursion bounded
- Suite-level includes follow the same merge behavior as other suite-level assertions and still respect
execution.skip_defaults: true
Shared Prompt Context
Section titled “Shared Prompt Context”Put shared prompt instructions in top-level prompts, with row-specific values
in tests[].vars and shared values in default_test.vars.
description: Travel assistant evaluationprompts: - - role: system content: Answer as a concise travel assistant. - role: user content: "{{ question }}"
tests: ./cases.yamlUse a block scalar for multi-line shared instructions:
prompts: - - role: system content: | Read AGENTS.md before answering. Explain the tradeoffs clearly. - role: user content: "{{ question }}"
tests: ./cases.yamlEach test in cases.yaml only needs its own data:
- id: japan-spring criteria: Recommends spring for cherry blossoms vars: question: When is the best time to visit Japan?File-Backed Prompt Context
Section titled “File-Backed Prompt Context”Normal eval YAML should model file-backed context through prompts and vars.
For example, put a path or file:// reference in tests[].vars, then render it
from a prompt entry next to the task text.
description: Schema review evaluation
prompts: - - role: user content: - type: file value: "{{ context_file }}" - type: text value: "{{ question }}"
tests: - id: summarize vars: context_file: ./shared-context.md question: Summarize the important constraints. - id: validate vars: context_file: ./schema.json question: What validation is missing?PROMPT.md Fallback
Section titled “PROMPT.md Fallback”For directory-style raw cases, a test may omit direct input and keep the task prompt in Markdown instead. AgentV resolves the prompt in this order:
- If the effective
input_filescontains a file named exactlyPROMPT.md, that file becomes the test prompt. - Otherwise, if a
PROMPT.mdexists beside theEVAL.yaml, that file becomes the test prompt. - Other
input_filesremain attachments.PROMPT.mdis removed from the attachment list so the prompt is not duplicated.
agent-001-fix-bug/ EVAL.yaml PROMPT.md fixtures/ failing-test.logtests: - id: fix-bug criteria: Fixes the regression described in the prompt input_files: - ./fixtures/failing-test.logUse explicit input when the prompt is short or generated from YAML variables.
Use PROMPT.md when the task text is long enough that duplicating it inside
YAML would make the eval hard to review.
Raw Cases as String Paths
Section titled “Raw Cases as String Paths”Instead of inlining tests in the same file, you can point tests to an external YAML or JSONL file of raw cases. This is the inverse of the sidecar pattern — the metadata file references the test data:
name: my-evaldescription: My evaluation suitetarget: defaulttests: ./cases.yamlThe path is resolved relative to the eval file’s directory. The external raw
case file can be a YAML or JSON array of test objects, a JSONL file with one
test per line, a CSV file with AgentV expected columns, or an explicit JavaScript or
Python dataset function such as file://generate-tests.mjs:createTests or
file://generate_tests.py:create_tests. String entries inside a tests: list
work the same way and may use direct paths, file:// paths, directories, or
globs:
tests: - ./cases/*.cases.yamlCSV datasets support magic columns. __expected and
__expectedN create AgentV assertions using the supported expected-column
mini-DSL (contains:*, icontains:*, contains-any:*, contains-all:*,
icontains-any:*, icontains-all:*, starts-with:*, ends-with:*,
regex:*, equals:*, is-json, latency(<ms>), cost(<usd>),
grade:*, llm-rubric:*, javascript:*, fn:*, eval:*, python:*, and
file://*.py; file paths inside CSV cells are resolved relative to the CSV
file). Unsupported assertion forms such as similar:* are rejected during
validation instead of being skipped at runtime.
__provider_output becomes first-class expected_output reference data,
__metric names the generated assertions, __threshold sets the test threshold,
__metadata:<key> adds metadata, and __config:__expectedN:threshold sets an
assertion min_score. Ordinary columns become vars, so CSV rows can rely on
suite-level input that interpolates those variables.
String shorthand is raw-case-only. Import reusable task suites through
imports.suites; use imports.tests when you want to drop suite context and
import only raw cases into the parent context:
imports: suites: - path: ./suites/*.eval.yaml tests: - path: ./cases/regression.jsonl
tests: - id: local-edge-case input: ...Legacy tests[].include entries still load with a migration warning for older
eval files, but new evals should use imports.suites or imports.tests.
Raw Cases as Directory Paths
Section titled “Raw Cases as Directory Paths”When tests points to a directory, AgentV auto-discovers test cases from subdirectories. Each subdirectory containing a case.yaml (or case.yml) becomes a test case:
my-eval/ EVAL.yaml cases/ fix-null-check/ case.yaml add-greeting/ case.yaml workspace/ # optional per-case workspace template setup-files...name: my-benchmarktests: ./cases/Each case.yaml is a single YAML object (not an array) with the same fields as an inline test:
criteria: Fixes the null reference bug in the parser modulevars: task: Fix the null check bug in parser.tsBehavior:
- Directory name as
id: Ifcase.yamldoesn’t specify anid, the directory name is used (e.g.,fix-null-check) - Alphabetical ordering: Subdirectories are sorted alphabetically for deterministic order
- Per-case workspace: A
workspace/subdirectory inside the case directory automatically setsworkspace.templateto that path, unless the case already defines aworkspacefield - Skipped directories: Subdirectories without
case.yamlare skipped with a warning - Suite-level config applies: Suite-level
assert,prompts,workspace,target, and top-level run controls still apply to directory-discovered cases
This pattern is useful for benchmarks with many cases, where each case benefits from its own directory for workspace templates, supporting files, or documentation. For guidance on keeping provenance metadata, patches, oracle files, and generated dataset rows out of oversized inline YAML, see Benchmark Provenance.
Environment Variable Interpolation
Section titled “Environment Variable Interpolation”All string fields in eval files support {{ env.VAR }} syntax for environment variable interpolation. This enables portable eval configs that work across machines and CI environments without hardcoded paths.
workspace: repos: - path: ./RepoA repo: "{{ env.REPO_A_URL }}" commit: "{{ env.REPO_A_COMMIT }}"
prompts: - "{{ prompt }}"
tests: - id: test-1 vars: prompt: "Evaluate the code in {{ env.PROJECT_NAME }}" criteria: "{{ env.EVAL_CRITERIA }}"Behavior
Section titled “Behavior”- Syntax:
{{ env.VARIABLE_NAME }}with optional whitespace around the name - Missing variables resolve to an empty string
- Partial interpolation is supported:
{{ env.HOME }}/repos/{{ env.PROJECT }}becomes/home/user/repos/myproject - Non-string values (numbers, booleans) are not affected
- Interpolation is applied recursively to all nested objects and arrays
- Works in YAML eval files, external YAML/JSONL case files, and external workspace config files
.envfiles in the directory hierarchy are loaded automatically before interpolation
Example: Portable Workspace Config
Section titled “Example: Portable Workspace Config”# workspace.yaml — works on any machinerepos: - path: ./my-repo repo: "{{ env.MY_REPO_URL }}" commit: "{{ env.MY_REPO_COMMIT }}"MY_REPO_URL=https://github.com/org/my-repo.gitMY_REPO_COMMIT=mainPer-Test Template Variables
Section titled “Per-Test Template Variables”Eval YAML supports per-test vars for data-driven prompt suites. Use
{{ vars.name }} placeholders in prompt entries and test-facing text fields,
and AgentV resolves them when the suite loads. Shared defaults can live in
default_test.vars; per-test vars override those defaults by key.
prompts: - "Answer clearly: {{ vars.question }}"
tests: - id: capital vars: question: What is the capital of France? expected_answer: Paris criteria: "Answers {{ vars.question }} correctly" expected_output: "{{ vars.expected_answer }}"Behavior
Section titled “Behavior”varsis defined per test as an object, with optional defaults fromdefault_test.vars{{ vars.name }}and dotted paths like{{ vars.user.name }}are supported- Substitution applies to
prompts,criteria,expected_output, assertion values/metrics, and conversation turninput/expected_output/ assertions - When the whole string is a single placeholder, the original JSON value is preserved
- Missing variables render as empty strings following Nunjucks semantics
varsinterpolation is separate from environment interpolation:{{ vars.question }}uses test data,{{ env.PROJECT_NAME }}uses environment variables
JSONL Format
Section titled “JSONL Format”For large-scale evaluations, AgentV supports JSONL (JSON Lines) format. Each line is a single test:
{"id": "test-1", "criteria": "Calculates correctly", "input": "What is 2+2?"}{"id": "test-2", "criteria": "Provides explanation", "input": "Explain variables"}Sidecar Metadata
Section titled “Sidecar Metadata”An optional YAML sidecar file provides metadata and execution config. Place it alongside the JSONL file with the same base name:
cases.jsonl + suite.yaml:
description: Math evaluation datasetsuite: math-teststarget: azure-baseassert: - name: correctness type: llm-rubric prompt: ./graders/correctness.mdBenefits of JSONL
Section titled “Benefits of JSONL”- Streaming-friendly — process line by line
- Git-friendly — diffs show individual case changes
- Programmatic generation — easy to create from scripts
- Industry standard — compatible with DeepEval, LangWatch, Hugging Face datasets
Converting Between Formats
Section titled “Converting Between Formats”Use the convert command to switch between YAML and JSONL:
agentv convert evals/suite.yaml --format jsonlagentv convert evals/cases.jsonl --format yaml