Cost Management in Architect CLI
Introduction
Calls to language models (LLMs) have a direct cost measured in tokens consumed. In workflows where an autonomous agent executes dozens of steps, each with thousands of context tokens, spending can scale rapidly if it is not monitored and controlled.
Architect CLI includes a comprehensive cost management system that covers three needs:
- Tracking: recording the exact cost of each LLM call, broken down by model, tokens, and source.
- Budget: setting spending limits per execution with graceful shutdown when exceeded.
- Optimization: reducing costs through prompt caching, model selection, and local development cache.
This document explains how each component works, how to configure it, and how to apply optimization strategies to keep costs under control.
How cost tracking works
CostTracker: per-step recording
The core of the system is CostTracker (in src/architect/costs/tracker.py). Every time the agent makes an LLM call, the tracker records a StepCost with the following information:
- step: agent step number
- model: model used (e.g.,
gpt-4o,claude-sonnet-4-6) - input_tokens: input tokens (full prompt)
- output_tokens: output tokens (model response)
- cached_tokens: tokens served from the provider cache (reduced cost)
- cost_usd: calculated cost in dollars
- source: call origin:
"agent"(main loop),"eval"(self-evaluation), or"summary"(context compression)
# Internal example: how the agent records each call
cost_tracker.record(
step=5,
model="gpt-4o",
usage={"prompt_tokens": 8500, "completion_tokens": 1200, "cache_read_input_tokens": 3000},
source="agent",
)
PriceLoader: price resolution
PriceLoader (in src/architect/costs/prices.py) resolves the price for each model following a priority order:
- Exact match: the model name matches a key in the pricing table.
- Prefix match: the model starts with a registered key (e.g.,
gpt-4o-2024-08-06matchesgpt-4o). - Base name match: the base prefix is extracted and a match is searched.
- Generic fallback: if no match is found, conservative prices of $3.00 / $15.00 per million tokens (input/output) are applied.
Prices are loaded from src/architect/costs/default_prices.json at startup. Optionally, they can be overridden with a custom file via configuration.
Token counting: input, output, and cached
The cost of a call is calculated with this formula:
cost = (non_cached_tokens / 1M) * input_price
+ (cached_tokens / 1M) * cached_input_price
+ (output_tokens / 1M) * output_price
Where non_cached_tokens = input_tokens - cached_tokens.
If the model does not have a cached input price defined, cached tokens are charged at the normal input price.
BudgetExceededError: graceful shutdown
When the accumulated cost exceeds the configured budget (budget_usd), the tracker raises BudgetExceededError. The agent loop catches this exception and performs a graceful shutdown:
- Agent execution is stopped.
- The partial result is returned with
status: "partial". - The cost summary is included in the output.
Additionally, there is a warning threshold (warn_at_usd) that emits a log warning when reached, without stopping execution. This allows configuring alerts before the full budget is exhausted.
Per-model pricing table
Prices updated as of February 2026. All values are in USD per million tokens.
| Model | Input $/1M | Output $/1M | Cached Input $/1M |
|---|---|---|---|
| OpenAI | |||
gpt-4o | 2.50 | 10.00 | 1.25 |
gpt-4o-mini | 0.15 | 0.60 | 0.075 |
gpt-4.1 | 2.00 | 8.00 | 0.50 |
gpt-4.1-mini | 0.40 | 1.60 | 0.10 |
gpt-4.1-nano | 0.10 | 0.40 | 0.025 |
o1 | 15.00 | 60.00 | 7.50 |
o1-mini | 1.10 | 4.40 | 0.55 |
o3-mini | 1.10 | 4.40 | 0.55 |
| Anthropic | |||
claude-opus-4-6 | 15.00 | 75.00 | 1.50 |
claude-sonnet-4-6 | 3.00 | 15.00 | 0.30 |
claude-haiku-4-5 | 0.80 | 4.00 | 0.08 |
claude-opus-4 | 15.00 | 75.00 | 1.50 |
claude-sonnet-4 | 3.00 | 15.00 | 0.30 |
claude-haiku-4 | 0.80 | 4.00 | 0.08 |
claude-3-5-sonnet | 3.00 | 15.00 | 0.30 |
claude-3-5-haiku | 0.80 | 4.00 | 0.08 |
gemini/gemini-2.0-flash | 0.10 | 0.40 | 0.025 |
gemini/gemini-2.5-pro | 1.25 | 10.00 | 0.315 |
gemini/gemini-1.5-pro | 1.25 | 5.00 | 0.3125 |
| DeepSeek | |||
deepseek/deepseek-chat | 0.27 | 1.10 | 0.07 |
deepseek/deepseek-reasoner | 0.55 | 2.19 | 0.14 |
| Other | |||
ollama (local) | 0.00 | 0.00 | 0.00 |
together_ai | 0.90 | 0.90 | — |
| (generic fallback) | 3.00 | 15.00 | — |
Model selection guide by task type
| Task | Recommended model | Reason |
|---|---|---|
| Code review, linting | gpt-4o-mini, claude-haiku-4-5, gemini-2.0-flash | Simple tasks that do not require deep reasoning |
| Planning, design | gpt-4o, claude-sonnet-4-6, gemini-2.5-pro | Good balance between quality and cost |
| Complex refactoring | gpt-4.1, claude-sonnet-4-6 | High code quality at moderate cost |
| Critical architecture | claude-opus-4-6, o1 | Maximum reasoning capability |
| Iterative development | ollama (local) | Zero cost, ideal for experimentation |
| Low-cost tasks | gpt-4.1-nano, deepseek/deepseek-chat | Ultra-low cost for simple tasks |
Cost configuration
YAML configuration
In the project’s architect.yaml file:
costs:
enabled: true # Enable/disable cost tracking (default: true)
budget_usd: 1.00 # Spending limit in USD per execution (null = no limit)
warn_at_usd: 0.75 # Warning threshold (log warning when reached)
prices_file: ./my_prices.json # JSON file with custom prices (optional)
costs.enabled: when true (default), the cost of each LLM call is recorded. If disabled, no costs are calculated and no budget is enforced.
costs.budget_usd: maximum spending limit in dollars per execution. If the accumulated cost exceeds it, the agent stops with status: "partial". Setting it to null (default) disables the limit.
costs.warn_at_usd: warning threshold. When accumulated spending reaches this value, a log warning is emitted. It does not stop execution. Useful for anticipating that the budget is running out.
costs.prices_file: path to a JSON file with custom prices. It has the same format as default_prices.json. Custom prices override the defaults for the specified models.
CLI flags
# Set budget from the command line
architect run "refactoriza el modulo auth" --budget 0.50
# Show cost summary at the end
architect run "genera tests" --show-costs
# Combine budget and cost display
architect run "refactoriza todo" --budget 0.50 --show-costs
| Flag | Description |
|---|---|
--budget FLOAT | Spending limit in USD for this execution |
--show-costs | Show cost summary at the end |
The --budget flag overrides the costs.budget_usd value from the YAML file for that execution.
Environment variables
Architect supports the following environment variables relevant to costs:
| Variable | Effect |
|---|---|
ARCHITECT_MODEL | Overrides the default model (llm.model) |
ARCHITECT_API_BASE | Overrides the API base URL (llm.api_base) |
To use a local model via Ollama:
export ARCHITECT_MODEL=ollama/llama3
export ARCHITECT_API_BASE=http://localhost:11434
architect run "tu tarea" --show-costs
# Cost: $0.0000
Prompt caching — reduce costs by up to 90%
How it works
Prompt caching is a feature provided by LLM providers (primarily Anthropic) that allows caching the system prompt between consecutive calls. In a typical agent flow, the system prompt is identical across all steps; only the history messages change.
When prompt caching is active, Architect adds cache_control to the system message. The provider caches that content and serves it from cache in subsequent calls at a significantly reduced price.
Typical savings: Anthropic models charge cached input at 10% of the normal price. This means a 5,000-token system prompt reused 20 times costs ~90% less than without caching.
Supported providers
| Provider | Support | Savings ratio |
|---|---|---|
| Anthropic (Claude) | Full | ~90% on cached tokens |
| OpenAI (GPT-4o) | Full | ~50% on cached tokens |
| Google (Gemini) | Full | ~75% on cached tokens |
| DeepSeek | Full | ~74% on cached tokens |
| Ollama (local) | N/A | Cost is always $0 |
Configuration
llm:
model: claude-sonnet-4-6
prompt_caching: true # Enable prompt caching (default: false)
When to use it
- Recommended: projects where Architect is executed repeatedly with the same system prompt (iterative development, CI/CD).
- Especially useful: with Anthropic models where savings reach ~90%.
- Impact: greatest in long executions (many steps) where the system prompt is repeated in each call.
- No effect: with local models (Ollama) where cost is already $0.
Savings example: with claude-sonnet-4-6 and a 4,000-token system prompt in a 15-step execution:
- Without caching: 15 * 4,000 = 60,000 tokens at $3.00/M = $0.18
- With caching: 4,000 at $3.00/M + 14 * 4,000 at $0.30/M = $0.012 + $0.0168 = $0.029
- Savings: ~84% on system prompt cost
Local LLM response cache
What it is
LocalLLMCache (in src/architect/llm/cache.py) is a deterministic on-disk cache that stores complete LLM responses. When the messages and tools of a call are identical to a previous call, the cached response is returned without making any API call.
Important: this cache is exclusively for development. It should not be used in production because cached responses do not reflect changes in the project context.
How it works
- A SHA-256 key is generated from the canonical JSON of
(messages, tools). - A
{hash}.jsonfile is looked up in the cache directory. - If it exists and has not expired (TTL), the stored response is returned.
- If it does not exist or has expired, the LLM call is made and the response is saved.
Cache failures are silent: they never break the agent flow.
YAML configuration
llm_cache:
enabled: false # Enable local cache (default: false)
dir: ~/.architect/cache # Storage directory
ttl_hours: 24 # Validity hours for each entry (1-8760)
CLI flags
# Enable local cache for this execution
architect run "genera tests" --cache
# Disable cache even if enabled in YAML
architect run "genera tests" --no-cache
# Clear all cache before executing
architect run "genera tests" --cache-clear
| Flag | Description |
|---|---|
--cache | Enable local LLM cache for this execution |
--no-cache | Disable local cache even if enabled in config |
--cache-clear | Delete all cache entries before executing |
When to use it
- Iterative development: when testing the same prompt repeatedly and wanting to avoid paying for each test.
- Debugging: to reproduce exact agent behaviors.
- Not in production: cached responses do not account for changes in project files.
- Not with dynamic prompts: if the prompt changes on every execution, the cache will have a very low hit rate.
Cost estimates by task type
Estimates assume the use of gpt-4o as the main model. Actual costs vary depending on project complexity, file sizes, and prompt quality.
| Task type | Typical steps | Estimated cost | Recommended model | Suggested budget |
|---|---|---|---|---|
| Simple code review | 3-5 | $0.01 - $0.05 | gpt-4o-mini | $0.10 |
| Planning | 3-5 | $0.03 - $0.10 | gpt-4o | $0.20 |
| Small code change | 5-10 | $0.05 - $0.20 | gpt-4o | $0.50 |
| Test generation | 8-15 | $0.10 - $0.40 | gpt-4o | $0.75 |
| Documentation | 5-10 | $0.05 - $0.15 | gpt-4o-mini | $0.30 |
| Complex refactoring | 15-30 | $0.20 - $1.00 | claude-sonnet-4-6 | $2.00 |
| Large new feature | 20-40 | $0.50 - $2.00 | gpt-4.1 | $3.00 |
Note: these costs are for a single agent execution. Advanced features (Ralph Loop, Parallel, etc.) multiply these values as described in the following section.
Cost multipliers in advanced features
Advanced Architect features execute multiple LLM calls internally. It is crucial to account for these multipliers when setting budgets.
Ralph Loop (iterations)
Each Ralph Loop iteration executes a complete agent from scratch (clean context). The cost is multiplied by the number of iterations.
ralph_cost = base_cost * N_iterations
Example: a task with a base cost of $0.30 and --max-iterations 5 can cost up to $1.50.
architect loop "implementa feature X" --check "pytest" --max-iterations 5 --budget 2.00
Parallel (workers)
Each worker in parallel execution is a complete architect run subprocess in an isolated git worktree. The cost is multiplied by the number of workers.
parallel_cost = base_cost * N_workers
Example: 3 workers with a base cost of $0.20 each = $0.60 total.
architect parallel --budget-per-worker 0.50
Auto-review
Automatic review adds at least one extra LLM call to analyze the generated diff. If issues are detected and a fix pass is triggered, an additional agent execution is added.
review_cost = base_cost + review_call_cost + (fix_pass_cost if issues found)
Estimate: +10-30% over the base cost.
Self-evaluation
Agent self-evaluation adds extra calls depending on the mode:
- basic: +1 LLM call at the end of execution.
- full: +1 call per retry (up to
max_retriesretries).
eval_cost_basic = base_cost + 1_eval_call
eval_cost_full = base_cost + N_retries * 1_eval_call
Context compression
When the agent context grows too large, automatic compression is triggered to summarize the history. This requires an extra LLM call.
compression_cost = base_cost + N_compressions * 1_summary_call
General estimation formula
To estimate the total cost of a complex execution:
total_cost = (base_cost + eval_calls + compression_calls + review_calls) * loop_factor * parallel_factor
Where:
base_cost = steps * average_tokens * price_per_token
eval_calls = 0 (no eval), 1 (basic), or N (full with retries)
compression_calls = number of times compression is triggered
review_calls = 0 (no review), 1-2 (with review)
loop_factor = 1 (no Ralph), or N (Ralph iterations)
parallel_factor = 1 (no parallel), or N (number of workers)
Complete example: complex refactoring (~$0.40 base) with basic self-eval (+$0.05), auto-review (+$0.08), in a Ralph Loop of 3 iterations:
($0.40 + $0.05 + $0.08) * 3 = $1.59
Recommended budget: $2.00
Optimization strategies
1. Select the right model for each task
Not all tasks require the most powerful model. Using gpt-4o-mini or claude-haiku-4-5 for review and documentation tasks can reduce costs by 90% compared to gpt-4o.
# architect.yaml — economical model as default
llm:
model: gpt-4o-mini
# Override for tasks that require more capability
# architect run "refactoring complejo" --model gpt-4o
2. Budget as a safety net
Always set a budget. Not as a spending target, but as protection against runaway executions:
costs:
budget_usd: 2.00 # Absolute maximum
warn_at_usd: 1.50 # Warning at 75%
In CI/CD it is especially important to avoid unexpected costs:
architect run "$TASK" --budget 1.00 --confirm-mode yolo
3. Improve prompt quality
A clear and specific prompt reduces the number of steps the agent needs. Fewer steps = fewer LLM calls = lower cost.
Comparison:
- Vague prompt: “fix the bugs” — 15-20 steps, $0.40
- Precise prompt: “fix the null check in
auth.py:42that causes a crash whenuser.emailis None” — 3-5 steps, $0.08
4. Context management
Configure context compression to prevent prompts from growing indefinitely:
agent:
max_steps: 25
context:
summarize_after_steps: 15 # Compress context after N steps
max_tool_result_tokens: 4000 # Limit tool results
Fewer context tokens = lower cost per step.
5. Local models for development
For rapid iteration during development, using Ollama with a local model completely eliminates API cost:
export ARCHITECT_MODEL=ollama/llama3
export ARCHITECT_API_BASE=http://localhost:11434
architect run "experimenta con esta logica" --show-costs
6. Prompt caching for team deployments
In environments where multiple developers or CI pipelines run Architect with the same system prompt, enabling prompt caching significantly reduces aggregate cost:
llm:
model: claude-sonnet-4-6
prompt_caching: true
Monitoring for teams
—show-costs output
When using --show-costs, Architect displays a summary at the end:
Costs: $0.0342 (8,450 in / 2,100 out / 3,200 cached)
This compact format shows: total cost, input tokens, output tokens, and cached tokens (if any).
JSON output
When using --json, the output includes a detailed cost block:
{
"status": "completed",
"result": "...",
"costs": {
"total_input_tokens": 45200,
"total_output_tokens": 12800,
"total_cached_tokens": 18000,
"total_tokens": 58000,
"total_cost_usd": 0.042,
"by_source": {
"agent": 0.038,
"eval": 0.004
}
}
}
Reports with costs
Reports generated with --report include cost information:
# JSON report with costs included
architect run "tarea" --report json --report-file report.json --show-costs
# Markdown report for documentation
architect run "tarea" --report markdown --report-file report.md
Cost aggregation in CI
To aggregate costs across multiple executions in CI/CD, you can parse the JSON output:
# In a CI pipeline
architect run "$TASK" --json --budget 1.00 > result.json
# Extract cost
jq '.costs.total_cost_usd' result.json
To maintain a historical record, you can send it to a metrics system or simply accumulate in a file:
COST=$(architect run "$TASK" --json | jq '.costs.total_cost_usd')
echo "$(date -Iseconds) $TASK $COST" >> costs.log
Budget alerts with hooks
Architect supports budget_warning hooks that execute when accumulated spending reaches the warning threshold:
costs:
budget_usd: 2.00
warn_at_usd: 1.50
hooks:
budget_warning:
- run: "echo 'ALERT: budget at 75%' | slack-notify"
- run: "curl -X POST https://monitoring.example.com/alert -d 'budget_warning'"
This allows integrating cost alerts with team notification systems (Slack, PagerDuty, custom webhooks, etc.).
Local models — zero cost
Ollama configuration
Ollama allows running language models locally without any API cost. Architect supports it natively through LiteLLM.
Ollama installation:
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull llama3
ollama pull codellama
Architect configuration:
llm:
model: ollama/llama3
api_base: http://localhost:11434
timeout: 120 # Local models can be slower
Or via environment variables:
export ARCHITECT_MODEL=ollama/llama3
export ARCHITECT_API_BASE=http://localhost:11434
architect run "tu tarea"
Registered prices
All models matching the ollama prefix have a $0.00 price in the pricing table:
{
"ollama": {
"input_per_million": 0.0,
"output_per_million": 0.0,
"cached_input_per_million": 0.0
}
}
Limitations
| Aspect | Cloud models | Local models (Ollama) |
|---|---|---|
| Cost | Variable based on usage | Always $0 |
| Quality | High (GPT-4o, Claude) | Variable, generally lower |
| Speed | Fast (GPU servers) | Depends on local hardware |
| Maximum context | 128K-200K tokens | Typically 4K-32K |
| Tool calling | Full | Limited support in some models |
| Availability | Requires internet | Works offline |
Usage recommendations
- Development and experimentation: ideal for iterating on prompts and flows at no cost.
- Simple tasks: variable renaming, formatting, boilerplate generation.
- Not recommended for: complex refactoring, architecture analysis, or any task where output quality is critical.
- Optimal combination: use Ollama during local development and a cloud model in CI/CD with a budget.
# Local development — $0 cost
ARCHITECT_MODEL=ollama/llama3 ARCHITECT_API_BASE=http://localhost:11434 architect run "prototipa esto"
# CI/CD — cloud model with budget
architect run "implementa feature" --budget 1.00 --show-costs