Competitive Evaluation (Competitive Eval)

Automated comparison of multiple LLM models executing the same task, with ranking based on quality, efficiency, and cost.

Implemented in src/architect/features/competitive.py. Available since v1.0.0 (Base plan v4 Phase D — D3).

Concept

architect eval runs the same task with multiple models in parallel (each in an isolated git worktree) and then runs the same validation checks on each worktree. It generates a comparative ranking based on a composite score.

architect eval "implement JWT authentication" \
  --models gpt-4o,claude-sonnet-4-6,gemini-2.0-flash \
  --check "pytest tests/test_auth.py -q" \
  --check "ruff check src/" \
  --budget-per-model 1.0

How it works

architect eval TASK --models m1,m2,m3 --check "cmd1" --check "cmd2"
  │
  ├── Create CompetitiveConfig
  │     └── task, models, checks, agent, max_steps, budget, timeout
  │
  ├── CompetitiveEval.run()
  │     ├── ParallelRunner (reuses parallel infrastructure)
  │     │     └── Each model → git worktree → `architect run` as subprocess
  │     │
  │     ├── For each resulting worktree:
  │     │     └── _run_checks_in_worktree(checks) → (passed, total)
  │     │
  │     └── _rank_results() → calculate composite score
  │
  ├── CompetitiveEval.generate_report()
  │     └── Markdown table with ranking
  │
  └── Display report (stdout or --report-file)

Scoring system

The composite score is out of 100 points:

Component	Weight	Calculation
Checks passed	40 pts	`(checks_passed / checks_total) * 40`
Status	30 pts	success=30, partial=15, timeout=5, failed=0
Efficiency	20 pts	Fewer steps = higher score (normalized)
Cost	10 pts	Lower cost = higher score (normalized)

CLI

architect eval PROMPT [options]

Options

Option	Description
`--models LIST`	Comma-separated models (required)
`--check CMD`	Verification command (repeatable, required)
`--agent NAME`	Agent to use (default: `build`)
`--max-steps N`	Maximum steps per model (default: 50)
`--budget-per-model N`	Cost limit per model in USD
`--timeout-per-model N`	Time limit per model in seconds
`--report-file PATH`	Save report to file
`--config PATH`	YAML configuration file
`--api-base URL`	LLM API base URL

Examples

# Compare 3 models with checks
architect eval "refactor utils.py" \
  --models gpt-4o,claude-sonnet-4-6,deepseek-chat \
  --check "pytest tests/ -q" \
  --check "ruff check src/" \
  --budget-per-model 0.50

# Save report
architect eval "optimize SQL queries" \
  --models gpt-4o,claude-sonnet-4-6 \
  --check "pytest" \
  --report-file eval_report.md

# With strict timeout
architect eval "implement feature" \
  --models gpt-4o-mini,claude-sonnet-4-6 \
  --check "pytest tests/" \
  --timeout-per-model 300 \
  --max-steps 30

API

`CompetitiveConfig`

@dataclass
class CompetitiveConfig:
    task: str
    models: list[str]
    checks: list[str]
    agent: str = "build"
    max_steps: int = 50
    budget_per_model: float | None = None
    timeout_per_model: int | None = None
    config_path: str | None = None
    api_base: str | None = None

`CompetitiveResult`

@dataclass
class CompetitiveResult:
    model: str
    status: str               # success | partial | failed | timeout
    steps: int
    cost: float
    duration: float
    files_modified: list[str]
    checks_passed: int
    checks_total: int
    worktree_path: str
    score: float              # composite score (0-100)

`CompetitiveEval`

class CompetitiveEval:
    def __init__(self, config: CompetitiveConfig, workspace_root: str): ...
    def run(self) -> list[CompetitiveResult]: ...
    def generate_report(self, results: list[CompetitiveResult]) -> str: ...

Generated report

The report includes:

Comparison table: model, status, steps, cost, time, checks passed, files modified
Ranking: sorted by composite score (1st, 2nd, 3rd place)
Check results: detail per model
Worktree paths: for manual inspection of each result

## Ranking

| # | Model | Score | Status | Steps | Cost | Checks |
|---|-------|-------|--------|-------|------|--------|
| 1 | gpt-4o | 85.0 | success | 12 | $0.42 | 3/3 |
| 2 | claude-sonnet-4-6 | 78.5 | success | 15 | $0.38 | 2/3 |
| 3 | deepseek-chat | 45.0 | partial | 30 | $0.12 | 1/3 |

Relationship with Parallel

CompetitiveEval reuses the ParallelRunner infrastructure (git worktrees + ProcessPoolExecutor). The difference is that:

parallel runs different tasks (or the same task) with possibly different models
eval runs the same task with different models and adds validation with checks + ranking

Files

File	Contents
`src/architect/features/competitive.py`	`CompetitiveEval`, `CompetitiveConfig`, `CompetitiveResult`
`src/architect/cli.py`	`architect eval` command
`tests/test_competitive/`	Unit tests