24 BENCHMARKING Development and operations Medium

AI Model Evaluation and Benchmarking

Competitive execution of the same task with multiple models: security and provenance comparison.

intake ☆☆☆
architect ★★★
vigil ★★☆
licit ★★☆

Competitive execution: same task implemented by multiple models in parallel. vigil evaluates each result security. licit generates provenance per model.

Phase 01 architect

Competitive execution

Same task implemented by 4 models in parallel.

△ architect
architect parallel "Implement /products CRUD with tests" \
  --models gpt-4.1,claude-sonnet-4,deepseek-chat,gemini-2.5-pro
Phase 02 vigil

Security per model

Evaluates security of each result.

◇ vigil
for w in parallel-{1,2,3,4}; do
  vigil scan .architect/worktrees/$w/src/ --format json
done
Phase 03 licit

Provenance per model

Compares models by OWASP security.

⬡ licit
licit trace
licit report --framework owasp-agentic