Evals¶
OTerminus ships a deterministic fixture-based eval harness for regression protection.
What evals cover¶
Fixture cases validate:
- expected proposal mode (
structuredorexperimental) - expected command family
- expected risk level
- expected acceptance/rejection
- expected rendered command and argv
- expected planner parse failures (when applicable)
- expected ambiguity interception (when applicable)
Fixture organization¶
Fixtures are JSON arrays under evals/cases/*.json, split by capability focus:
direct_commands.jsonfilesystem_inspection.jsonfilesystem_mutation.jsontext_inspection.jsonprocess_inspection.jsonsystem_inspection.jsonmacos_desktop.jsonunsafe_and_blocked.jsonambiguity.jsonplanner_normalization.json
All fixture IDs must be unique across all files. The eval loader reads every *.json file in sorted
filename order and preserves per-file case order.
Fixture format¶
Core fields include:
iduser_input- optional
planner_proposal - expected outputs (
expected_*fields)
These evals are not live LLM tests. They are deterministic fixture checks. For planner-path cases,
planner_proposal supplies the mocked planner output payload. Ambiguity cases assert request
interception before planner parsing/validation.
Running evals¶
poetry run oterminus-evals
poetry run oterminus-evals --fixtures-dir evals/cases
A non-zero exit code indicates at least one failing case.
When to add eval cases¶
Add or update fixture cases when any of the following change:
- command support/capability behavior
- planner payload shape or parsing behavior
- validator or policy behavior
- ambiguity detection behavior
- direct-command detection behavior
Relationship to tests¶
- unit tests verify module behavior and edge cases
- eval fixtures verify end-to-end proposal/validation invariants across representative prompts