AI Model Evaluation and Benchmarking
Competitive execution of the same task with multiple models: security and provenance comparison.
Context
Competitive execution: same task implemented by multiple models in parallel. vigil evaluates each result security. licit generates provenance per model.
Flow with 4 tools
△ Phase 01 — architect
Competitive execution
Same task implemented by 4 models in parallel.
△ architect
architect parallel "Implement /products CRUD with tests" \
--models gpt-4.1,claude-sonnet-4,deepseek-chat,gemini-2.5-pro ◇ Phase 02 — vigil
Security per model
Evaluates security of each result.
◇ vigil
for w in parallel-{1,2,3,4}; do
vigil scan .architect/worktrees/$w/src/ --format json
done ⬡ Phase 03 — licit
Provenance per model
Compares models by OWASP security.
⬡ licit
licit trace
licit report --framework owasp-agentic