Test-driver — Paul Barrick

The problem

Agentic features are non-deterministic. Snapshot tests pass or fail on the wrong things. Coverage isn't the question — repeatable evidence is. When an LLM-backed feature regresses, the operator wants to see what actually happened during a real walk: which screens loaded, which assertions held, what severity the failure had, which commit made it worse.

Test-driver is the agent that produces that evidence on a schedule so the operator never has to ask "is the build still good?" without a recent answer in front of them.

How it works

Discovery. Reads feature manifests from Projects/<app>/docs/feature-discovery/ — each manifest declares table-stakes assertions and feature-level assertions.
Walk. Drives a real Playwright session against the live UI, captures DOM snapshots, console logs, and network errors per assertion.
Score. Each assertion failure is graded HIGH / MEDIUM / LOW / N/A. The walk gets an aggregate severity.
Persist. The full run manifest — entries, screenshots, severity, duration — gets written back to nexus_items with type test_runs. Now it's queryable alongside everything else the team produces.

Schedule

A weekly LaunchAgent fires npm run test-driver from the Optimus repo every Monday at 09:00. Output gets pushed straight to the test-drive UI in Optimus. On-demand runs are a click away from the same UI when the operator wants confirmation before a release.

What a manifest looks like

{
  "run_id": "2026-05-04T20-22-30",
  "kind": "walk",
  "started_at": "2026-05-04T20:22:30Z",
  "finished_at": "2026-05-04T20:24:11Z",
  "duration_ms": 101_312,
  "feature_slug": "goals-rollup",
  "screen_count": 9,
  "link_count": 14,
  "exit_code": 0,
  "totals": {
    "high_severity": 0,
    "medium_severity": 1,
    "low_severity": 0,
    "not_applicable": 2
  },
  "entries": [
    {
      "id": "TS-1",
      "title": "Goal card opens detail pane on click",
      "category": "table-stakes",
      "rationale": "Regression risk after Svelte 5.50 store reactivity bugs.",
      "result": "pass",
      "severity": "N/A"
    },
    /* ... */
  ]
}

Dogfooding

Test-driver walks itself. The test-drive UI in Optimus is the first feature that gets exercised on every scheduled run, so a regression in the verification layer surfaces in its own evidence stream before it reaches anywhere else.