24 BENCHMARKING Development and operations Medium

AI Model Evaluation and Benchmarking

Competitive execution of the same task with multiple models: security and provenance comparison.

◻ intake ☆☆☆

△ architect ★★★

◇ vigil ★★☆

⬡ licit ★★☆

Context

Competitive execution: same task implemented by multiple models in parallel. vigil evaluates each result security. licit generates provenance per model.

Flow with 4 tools

△ Phase 01 — architect

Competitive execution

Same task implemented by 4 models in parallel.

△ architect

architect parallel "Implement /products CRUD with tests" \
  --models gpt-4.1,claude-sonnet-4,deepseek-chat,gemini-2.5-pro

◇ Phase 02 — vigil

Security per model

Evaluates security of each result.

◇ vigil

for w in parallel-{1,2,3,4}; do
  vigil scan .architect/worktrees/$w/src/ --format json
done

⬡ Phase 03 — licit

Provenance per model

Compares models by OWASP security.

⬡ licit

licit trace
licit report --framework owasp-agentic