Skip to content

Evals

OTerminus ships a deterministic fixture-based eval harness for regression protection.

What evals cover

Fixture cases validate:

  • expected proposal mode (structured or experimental)
  • expected command family
  • expected risk level
  • expected acceptance/rejection
  • expected rendered command and argv
  • expected planner parse failures (when applicable)
  • expected ambiguity interception (when applicable)

Fixture organization

Fixtures are JSON arrays under evals/cases/*.json, split by capability focus:

  • direct_commands.json
  • filesystem_inspection.json
  • filesystem_mutation.json
  • text_inspection.json
  • process_inspection.json
  • system_inspection.json
  • macos_desktop.json
  • unsafe_and_blocked.json
  • ambiguity.json
  • planner_normalization.json

All fixture IDs must be unique across all files. The eval loader reads every *.json file in sorted filename order and preserves per-file case order.

Fixture format

Core fields include:

  • id
  • user_input
  • optional planner_proposal
  • expected outputs (expected_* fields)

These evals are not live LLM tests. They are deterministic fixture checks. For planner-path cases, planner_proposal supplies the mocked planner output payload. Ambiguity cases assert request interception before planner parsing/validation.

Running evals

poetry run oterminus-evals
poetry run oterminus-evals --fixtures-dir evals/cases

A non-zero exit code indicates at least one failing case.

When to add eval cases

Add or update fixture cases when any of the following change:

  • command support/capability behavior
  • planner payload shape or parsing behavior
  • validator or policy behavior
  • ambiguity detection behavior
  • direct-command detection behavior

Relationship to tests

  • unit tests verify module behavior and edge cases
  • eval fixtures verify end-to-end proposal/validation invariants across representative prompts