Eval Files

Evaluation files define the test cases, graders, workspace lifecycle, and run controls for an evaluation run. The reserved tags.experiment key is the run/result grouping label, top-level target identifies the system under test, and fields such as evaluate_options.repeat, threshold, timeout_seconds, evaluate_options.budget_usd, and evaluate_options.max_concurrency control repeated attempts and gates. Workspace lifetime belongs under workspace.scope; repository provenance belongs under workspace.repos; Docker/container binding belongs under workspace.docker. Non-provisioning setup commands belong in top-level extensions; reset policy stays under workspace.hooks.after_each.reset; runner-specific setup belongs in the target object, in targets, or in project config. AgentV supports two eval data formats: YAML and JSONL.

YAML is the canonical portable model. TypeScript helpers, generated fixtures, and Python scripts should lower to the same YAML/JSONL shapes rather than inventing a separate eval contract. Eval files describe the task, target binding, and run controls. Use evaluate_options.max_concurrency for authored suite concurrency. Operators can still override concurrency with agentv eval --workers N; do not author legacy workers fields in eval YAML.

For Promptfoo-style authoring, AgentV uses the same broad prompt, test, vars, default test, evaluate-options, and assertion contract with snake_case wire fields. See the Promptfoo parity matrix for the exact alignments, intentional AgentV extensions, and future-scope Promptfoo surfaces.

Authoring Shapes

Eval YAML is AgentV’s composable and runnable authoring primitive. It is a focused, shareable slice of the same config graph as .agentv/config.yaml. Use ordinary *.eval.yaml files for direct task suites and for wrapper evals that compose other suites. Raw case files are reusable data inputs, not a second runnable experiment format.

A task suite is eval YAML that owns task context: workspace, shared input, shared assert, fixtures, graders, and test cases. It can run directly or be imported through imports.suites.
A raw case file is a YAML, JSON, JSONL, CSV, script-backed dataset, directory, or glob of cases. Import it with imports.tests, tests: ./cases.yaml, tests: file://cases.csv, or string shorthand; parent suite context applies because raw cases do not carry their own suite context.
A wrapper eval is eval YAML that imports one or more suites with imports.suites and binds run controls with top-level target, threshold, timeout_seconds, and evaluate_options. Wrapper evals can live anywhere in the repo. A wrapper that imports suites with imports.suites must not define parent workspace; imported suites own task environment. Machine-local existing workspace paths belong in CLI flags or config.local.yaml, not eval YAML.

For example, a reusable task suite can keep the task contract in one file:

suite: refunds
workspace:
  repos:
    - path: ./support-app
      repo: acme/support-app
      commit: main
input: Answer using the refund policy in the workspace.
assert:
  - Applies the refund policy correctly
tests:
  - id: missing-receipt
    input: Can this customer get a refund without a receipt?

Raw cases are just case data:

- id: damaged-item
  input: The item arrived damaged. What should support do?
  expected_output: Offer a replacement or refund path.

A wrapper eval stays ordinary eval YAML while choosing a target and run controls:

name: refunds-codex
target: codex-gpt5
evaluate_options:
  max_concurrency: 3
  repeat:
    count: 2
    strategy: pass_any

imports:
  suites:
    - path: ../evals/suites/refunds.eval.yaml
  tests:
    - path: ../evals/cases/refund-smoke.cases.yaml

tests:
  - id: local-edge-case
    vars:
      question: Can a final-sale item be refunded after damage in transit?
    expected_output: Explain the final-sale exception for damaged transit.

The experiments/ directory in that example is optional and user-owned. AgentV does not infer behavior from the path; the wrapper runs because it is eval YAML with tests or imports. The wrapper owns target selection and run controls. Put workspace setup in imported child suites. Parent workspace-affecting fields, including top-level workspace, are for parent-owned raw cases, including cases imported with imports.tests. Runtime workspace path overrides belong in CLI flags or .agentv/config.local.yaml; repos, hooks, templates, Docker config, env checks, and workspace scope belong in top-level or case-level workspace.

YAML Format

The primary format. A single file contains metadata, inline runtime config, and tests:

description: Math problem solving evaluation
target: default

prompts:
  - "{{ question }}"

assert:
  - Correctly calculates the answer
  - Explains the calculation briefly

tests:
  - id: addition
    vars:
      question: What is 15 + 27?
    expected_output: "42"

Top-level Fields

Field	Description
`description`	Human-readable description of the evaluation
`suite`	Optional suite identifier
`category`	Optional slash-delimited analytics taxonomy path. Overrides the category derived from the eval file path.
`target`	System under test by configured target `id` or inline target object
`tags`	Optional metadata map. Use `tags.experiment` as the run/result grouping label.
`prompts`	Optional top-level prompt matrix. Entries can be strings, chat message arrays, files, or generated prompt functions rendered with `tests[].vars` and `default_test.vars`.
`targets`	Optional target matrix. Entries reference target ids or inline target objects.
`evaluate_options.repeat`	Optional repeat policy as a positive integer shorthand or object with `count`, `strategy`, `early_exit`, and `cost_limit_usd`
`evaluate_options`	Optional evaluation runtime options such as `budget_usd`, `repeat`, and `max_concurrency`
`timeout_seconds`	Optional per-case timeout
`threshold`	Optional suite quality threshold
`workspace`	Suite-level task environment — inline object or string path to an external workspace file. Repo entries declare identity and checkout pins; acquisition is covered in Workspace Architecture.
`extensions`	Lifecycle hooks: `file://path/to/hooks.mjs:beforeAll`, `beforeEach`, `afterEach`, `afterAll`, plus the built-in `agentv:agent-rules`. Hooks run after `workspace.repos` materializes.
`imports`	Optional import groups. `imports.suites` imports full child eval suites with their task context. `imports.tests` imports raw test rows into this file’s context. Import entries may use scoped `run:` overrides for `threshold`, `repeat`, `timeout_seconds`, and `budget_usd`.
`tests`	Inline raw tests or a string path to an external raw-case file or directory. Legacy `tests[].include` entries still load with a migration warning; prefer `imports.suites` or `imports.tests`.
`assert`	Suite-level graders appended to each test unless `execution.skip_defaults: true` is set on the test

workspace is what the agent can inspect or modify through tools, not prompt input. Put task instructions and chat/system/user messages in prompts; put repos, templates, Docker config, env checks, scope, and repo provenance in workspace. Put lifecycle setup that does not acquire repos in extensions.

For historical or repo-state evals, put the checkout under workspace.repos[].commit. A commit SHA in the prompt or metadata is useful context, but it does not materialize a repo for the agent to inspect.

Prompts, Vars, and Target Expansion

Use top-level prompts when you want the Promptfoo-compatible authoring shape: prompt templates at the top level, test data in tests[].vars, and shared test-data defaults in default_test.vars. AgentV renders each prompt with the merged vars for each test, then expands the run as prompts x targets x tests x repeat before execution. Each expanded row keeps the original test_id plus prompt and target identity for Dashboard filtering, reruns, and comparisons.

description: Release-note summarization
tags:
  experiment: prompt-matrix

prompts:
  - id: direct
    label: Direct
    prompt: "Summarize {{ topic }} for {{ audience }}."
  - id: terse
    label: Terse
    prompt: "In one sentence, summarize {{ topic }} for {{ audience }}."

default_test:
  vars:
    audience: engineers

targets:
  - id: local-mini
    provider: openai
    runtime: host
    config:
      model: gpt-5.4-mini
  - id: local-codex
    provider: codex-app-server
    runtime: host
    config:
      command: ["codex", "app-server"]
      model: gpt-5-codex

tests:
  - id: release-notes
    vars:
      topic: the July release notes
    expected_output: concise release-note summary
    assert:
      - Identifies the most important change
      - Avoids unsupported details
  - id: roadmap
    vars:
      audience: executives
      topic: the next roadmap phase
    expected_output: concise roadmap summary
    assert:
      - Identifies the main product direction
      - Uses the requested audience framing

For prompt matrices, put per-case data in tests[].vars and shared defaults in default_test.vars. tests[].vars overrides default_test.vars by key. Prompt templates can use either {{ name }} or {{ vars.name }} placeholders; the top-level form matches Promptfoo-style prompt templates, while the vars.* namespace is explicit and useful when a key might collide with other template context.

Do not author direct input fields in normal eval YAML. tests[].input and top-level input are removed from public authored eval files. Put prompt text, system messages, and user messages in top-level prompts; put row data in tests[].vars and shared defaults in default_test.vars.

For simple task text, use a prompt template such as "{{ input }}" or "{{ vars.input }}" with:

tests:
  - id: direct
    vars:
      input: Summarize the July release notes.

External raw-case files imported through tests: file://... or imports.tests may still contain raw internal input rows for compatibility. Keep that compatibility out of normal eval YAML authoring.

Lifecycle Extensions

extensions uses AgentV lifecycle names. File hooks are local JavaScript or TypeScript modules resolved relative to the eval file:

extensions:
  - file://scripts/setup.mjs:beforeAll
  - file://scripts/setup.mjs:beforeEach
  - file://scripts/setup.mjs:afterEach
  - file://scripts/setup.mjs:afterAll

Each exported function receives a context object with snake_case keys such as workspace_path, test_id, eval_run_id, case_input, and case_metadata. Setup hook failures (beforeAll, beforeEach) fail the affected run; teardown hook failures (afterEach, afterAll) are non-fatal.

agentv:agent-rules is the only built-in extension in this slice. It runs after workspace materialization and exposes staged rule paths to providers and result metadata as agent_rules_paths:

extensions:
  - id: agentv:agent-rules
    hook: beforeAll
    skills: agent-rules/skills
    hooks: agent-rules/hooks
    agents: agent-rules/agents
    rules: agent-rules/AGENTS.md

If agentv:agent-rules is authored as a string, it defaults to beforeAll and discovers conventional rule locations already present in the materialized workspace. It does not clone repositories or replace workspace.repos.

Metadata Fields

You can add structured metadata to your eval file using these optional top-level fields. Metadata is parsed when the name field is present:

Field	Description
`name`	Machine-readable identifier (lowercase, hyphens, max 64 chars). Triggers metadata parsing.
`description`	Human-readable description (max 1024 chars)
`version`	Eval version string (e.g., `"1.0"`)
`author`	Author or team identifier
`tags`	Array of string tags for categorization
`license`	License identifier (e.g., `"MIT"`, `"Apache-2.0"`)
`requires`	Dependency constraints (e.g., `agentv: ">=0.30.0"`)

name: export-screening
description: Evaluates export control screening accuracy
version: "1.0"
author: acme-compliance
tags: [compliance, agents]
license: Apache-2.0
requires:
  agentv: ">=0.30.0"

tests:
  - id: denied-party
    criteria: Identifies denied parties correctly
    input: Screen "Acme Corp" against denied parties list

When category is omitted, AgentV derives it from the eval file path. Generic filenames do not add a leaf: security/eval.yaml becomes security, and security/network/suite.yaml becomes security/network. A meaningful named eval file contributes a leaf, so security/network.eval.yaml becomes security/network. Existing flat category strings remain valid one-node category paths.

Suite-level Assert

The assert field is the canonical way to define suite-level graders. Suite-level assertions are appended to every test’s graders unless a test sets execution.skip_defaults: true. For semantic or agent-behavior checks, prefer plain assertion strings first; AgentV treats them as rubric criteria. Use deterministic assertions or script graders when the expected output is exact or requires programmatic inspection. If the assertion strings already state the grading contract, omit a duplicate criteria field on each test. Use explicit type: llm-rubric entries only when you need a custom prompt, a custom grader target, or a deliberately separate grader panel.

description: API response validation
assert:
  - type: is-json
    required: true
  - type: contains
    value: "status"
  - Correctly answers the user's question
  - Explains the reasoning clearly

tests:
  - id: health-check
    input: Check API health

assert supports rubric shorthand strings, deterministic assertion types (contains, regex, is-json, equals), llm-rubric, and script graders. See Tests for per-test assert usage.

Assert Includes

Reusable assertion sets can be factored into template files and referenced from any assert array:

assert:
  - include: safe-response
  - include: ./shared/format.yaml

Resolution rules:

include: name resolves to .agentv/templates/{name}.yaml with the closest matching directory winning
Relative paths resolve from the eval file location, so include: ./shared/format.yaml works as expected
Nested includes are allowed up to depth 3 to keep cycles and runaway recursion bounded
Suite-level includes follow the same merge behavior as other suite-level assertions and still respect execution.skip_defaults: true

Shared Prompt Context

Put shared prompt instructions in top-level prompts, with row-specific values in tests[].vars and shared values in default_test.vars.

description: Travel assistant evaluation
prompts:
  - - role: system
      content: Answer as a concise travel assistant.
    - role: user
      content: "{{ question }}"

tests: ./cases.yaml

Use a block scalar for multi-line shared instructions:

prompts:
  - - role: system
      content: |
        Read AGENTS.md before answering.
        Explain the tradeoffs clearly.
    - role: user
      content: "{{ question }}"

tests: ./cases.yaml

Each test in cases.yaml only needs its own data:

- id: japan-spring
  criteria: Recommends spring for cherry blossoms
  vars:
    question: When is the best time to visit Japan?

File-Backed Prompt Context

Normal eval YAML should model file-backed context through prompts and vars. For example, put a path or file:// reference in tests[].vars, then render it from a prompt entry next to the task text.

description: Schema review evaluation

prompts:
  - - role: user
      content:
        - type: file
          value: "{{ context_file }}"
        - type: text
          value: "{{ question }}"

tests:
  - id: summarize
    vars:
      context_file: ./shared-context.md
      question: Summarize the important constraints.
  - id: validate
    vars:
      context_file: ./schema.json
      question: What validation is missing?

PROMPT.md Fallback

For directory-style raw cases, a test may omit direct input and keep the task prompt in Markdown instead. AgentV resolves the prompt in this order:

If the effective input_files contains a file named exactly PROMPT.md, that file becomes the test prompt.
Otherwise, if a PROMPT.md exists beside the EVAL.yaml, that file becomes the test prompt.
Other input_files remain attachments. PROMPT.md is removed from the attachment list so the prompt is not duplicated.

agent-001-fix-bug/
  EVAL.yaml
  PROMPT.md
  fixtures/
    failing-test.log

tests:
  - id: fix-bug
    criteria: Fixes the regression described in the prompt
    input_files:
      - ./fixtures/failing-test.log

Use explicit input when the prompt is short or generated from YAML variables. Use PROMPT.md when the task text is long enough that duplicating it inside YAML would make the eval hard to review.

Raw Cases as String Paths

Instead of inlining tests in the same file, you can point tests to an external YAML or JSONL file of raw cases. This is the inverse of the sidecar pattern — the metadata file references the test data:

name: my-eval
description: My evaluation suite
target: default
tests: ./cases.yaml

The path is resolved relative to the eval file’s directory. The external raw case file can be a YAML or JSON array of test objects, a JSONL file with one test per line, a CSV file with AgentV expected columns, or an explicit JavaScript or Python dataset function such as file://generate-tests.mjs:createTests or file://generate_tests.py:create_tests. String entries inside a tests: list work the same way and may use direct paths, file:// paths, directories, or globs:

tests:
  - ./cases/*.cases.yaml

CSV datasets support magic columns. __expected and __expectedN create AgentV assertions using the supported expected-column mini-DSL (contains:*, icontains:*, contains-any:*, contains-all:*, icontains-any:*, icontains-all:*, starts-with:*, ends-with:*, regex:*, equals:*, is-json, latency(<ms>), cost(<usd>), grade:*, llm-rubric:*, javascript:*, fn:*, eval:*, python:*, and file://*.py; file paths inside CSV cells are resolved relative to the CSV file). Unsupported assertion forms such as similar:* are rejected during validation instead of being skipped at runtime. __provider_output becomes first-class expected_output reference data, __metric names the generated assertions, __threshold sets the test threshold, __metadata:<key> adds metadata, and __config:__expectedN:threshold sets an assertion min_score. Ordinary columns become vars, so CSV rows can rely on suite-level input that interpolates those variables.

String shorthand is raw-case-only. Import reusable task suites through imports.suites; use imports.tests when you want to drop suite context and import only raw cases into the parent context:

imports:
  suites:
    - path: ./suites/*.eval.yaml
  tests:
    - path: ./cases/regression.jsonl

tests:
  - id: local-edge-case
    input: ...

Legacy tests[].include entries still load with a migration warning for older eval files, but new evals should use imports.suites or imports.tests.

Raw Cases as Directory Paths

When tests points to a directory, AgentV auto-discovers test cases from subdirectories. Each subdirectory containing a case.yaml (or case.yml) becomes a test case:

my-eval/
  EVAL.yaml
  cases/
    fix-null-check/
      case.yaml
    add-greeting/
      case.yaml
      workspace/        # optional per-case workspace template
        setup-files...

name: my-benchmark
tests: ./cases/

Each case.yaml is a single YAML object (not an array) with the same fields as an inline test:

criteria: Fixes the null reference bug in the parser module
vars:
  task: Fix the null check bug in parser.ts

Behavior:

Directory name as id: If case.yaml doesn’t specify an id, the directory name is used (e.g., fix-null-check)
Alphabetical ordering: Subdirectories are sorted alphabetically for deterministic order
Per-case workspace: A workspace/ subdirectory inside the case directory automatically sets workspace.template to that path, unless the case already defines a workspace field
Skipped directories: Subdirectories without case.yaml are skipped with a warning
Suite-level config applies: Suite-level assert, prompts, workspace, target, and top-level run controls still apply to directory-discovered cases

This pattern is useful for benchmarks with many cases, where each case benefits from its own directory for workspace templates, supporting files, or documentation. For guidance on keeping provenance metadata, patches, oracle files, and generated dataset rows out of oversized inline YAML, see Benchmark Provenance.

Environment Variable Interpolation

All string fields in eval files support {{ env.VAR }} syntax for environment variable interpolation. This enables portable eval configs that work across machines and CI environments without hardcoded paths.

workspace:
  repos:
    - path: ./RepoA
      repo: "{{ env.REPO_A_URL }}"
      commit: "{{ env.REPO_A_COMMIT }}"

prompts:
  - "{{ prompt }}"

tests:
  - id: test-1
    vars:
      prompt: "Evaluate the code in {{ env.PROJECT_NAME }}"
    criteria: "{{ env.EVAL_CRITERIA }}"

Behavior

Syntax: {{ env.VARIABLE_NAME }} with optional whitespace around the name
Missing variables resolve to an empty string
Partial interpolation is supported: {{ env.HOME }}/repos/{{ env.PROJECT }} becomes /home/user/repos/myproject
Non-string values (numbers, booleans) are not affected
Interpolation is applied recursively to all nested objects and arrays
Works in YAML eval files, external YAML/JSONL case files, and external workspace config files
.env files in the directory hierarchy are loaded automatically before interpolation

Example: Portable Workspace Config

# workspace.yaml — works on any machine
repos:
  - path: ./my-repo
    repo: "{{ env.MY_REPO_URL }}"
    commit: "{{ env.MY_REPO_COMMIT }}"

MY_REPO_URL=https://github.com/org/my-repo.git
MY_REPO_COMMIT=main

Per-Test Template Variables

Eval YAML supports per-test vars for data-driven prompt suites. Use {{ vars.name }} placeholders in prompt entries and test-facing text fields, and AgentV resolves them when the suite loads. Shared defaults can live in default_test.vars; per-test vars override those defaults by key.

prompts:
  - "Answer clearly: {{ vars.question }}"

tests:
  - id: capital
    vars:
      question: What is the capital of France?
      expected_answer: Paris
    criteria: "Answers {{ vars.question }} correctly"
    expected_output: "{{ vars.expected_answer }}"

Behavior

vars is defined per test as an object, with optional defaults from default_test.vars
{{ vars.name }} and dotted paths like {{ vars.user.name }} are supported
Substitution applies to prompts, criteria, expected_output, assertion values/metrics, and conversation turn input / expected_output / assertions
When the whole string is a single placeholder, the original JSON value is preserved
Missing variables render as empty strings following Nunjucks semantics
vars interpolation is separate from environment interpolation: {{ vars.question }} uses test data, {{ env.PROJECT_NAME }} uses environment variables

JSONL Format

For large-scale evaluations, AgentV supports JSONL (JSON Lines) format. Each line is a single test:

{"id": "test-1", "criteria": "Calculates correctly", "input": "What is 2+2?"}
{"id": "test-2", "criteria": "Provides explanation", "input": "Explain variables"}

Sidecar Metadata

An optional YAML sidecar file provides metadata and execution config. Place it alongside the JSONL file with the same base name:

cases.jsonl + suite.yaml:

description: Math evaluation dataset
suite: math-tests
target: azure-base
assert:
  - name: correctness
    type: llm-rubric
    prompt: ./graders/correctness.md

Benefits of JSONL

Streaming-friendly — process line by line
Git-friendly — diffs show individual case changes
Programmatic generation — easy to create from scripts
Industry standard — compatible with DeepEval, LangWatch, Hugging Face datasets

Converting Between Formats

Use the convert command to switch between YAML and JSONL:

agentv convert evals/suite.yaml --format jsonl
agentv convert evals/cases.jsonl --format yaml