Overview
Dakora’s model comparison feature allows you to execute the same template across multiple LLM models simultaneously and compare their outputs, costs, and performance.
This is useful for:
- Model Selection: Find the best model for your use case
- Cost Optimization: Compare costs across providers
- Quality Assessment: Evaluate output quality side-by-side
- A/B Testing: Test different models in production scenarios
Quick Start
Python API
from dakora import Vault
vault = Vault("dakora.yaml")
template = vault.get("summarizer")
# Compare across 3 models
result = template.compare(
models=["gpt-4", "claude-3-opus", "gemini-pro"],
input_text="Your article content here..."
)
# Access results
for r in result.results:
print(f"{r.model}: {r.output[:100]}...")
print(f" Cost: ${r.cost_usd:.4f}")
print(f" Latency: {r.latency_ms}ms")
# Aggregate metrics
print(f"\nTotal Cost: ${result.total_cost_usd:.4f}")
print(f"Success Rate: {result.successful_count}/{len(result.results)}")
CLI
dakora compare summarizer \
--models gpt-4,claude-3-opus,gemini-pro \
--input-text "Your article content here..."
Features
Parallel Execution
Models are executed in parallel using asyncio for maximum speed:
# All 5 models execute simultaneously
result = template.compare(
models=["gpt-4", "gpt-4-turbo", "claude-3-opus", "claude-3-sonnet", "gemini-pro"],
input_text="Text to analyze"
)
Parallel execution significantly reduces total wait time compared to sequential execution.
Graceful Failure Handling
If one model fails, others continue execution:
result = template.compare(
models=["gpt-4", "nonexistent-model", "claude-3-opus"],
input_text="Test"
)
# Check which succeeded
for r in result.results:
if r.error:
print(f"❌ {r.model}: {r.error}")
else:
print(f"✅ {r.model}: Success")
print(f"Successful: {result.successful_count}/{len(result.results)}")
Example output:
✅ gpt-4: Success
❌ nonexistent-model: Model not found
✅ claude-3-opus: Success
Successful: 2/3
Automatic tracking of execution metadata:
result = template.compare(
models=["gpt-4", "claude-3-opus"],
input_text="Test"
)
# Per-model metrics
for r in result.results:
print(f"{r.model}:")
print(f" Provider: {r.provider}")
print(f" Cost: ${r.cost_usd:.4f}")
print(f" Latency: {r.latency_ms}ms")
print(f" Tokens: {r.tokens_in} → {r.tokens_out}")
# Aggregate metrics
print(f"\nTotal Cost: ${result.total_cost_usd:.4f}")
print(f"Total Tokens: {result.total_tokens_in} → {result.total_tokens_out}")
Usage Examples
Basic Comparison
from dakora import Vault
vault = Vault("dakora.yaml")
template = vault.get("summarizer")
result = template.compare(
models=["gpt-4", "claude-3-opus"],
input_text="Long article text..."
)
# Print outputs side-by-side
for r in result.results:
print(f"\n{'='*60}")
print(f"Model: {r.model}")
print(f"{'='*60}")
print(r.output)
With LLM Parameters
Pass parameters that apply to all models:
result = template.compare(
models=["gpt-4", "claude-3-opus", "gemini-pro"],
input_text="Article...",
temperature=0.7,
max_tokens=150
)
Cost Analysis
Find the most cost-effective model:
result = template.compare(
models=["gpt-4", "gpt-3.5-turbo", "claude-3-haiku", "gemini-flash"],
input_text="Test article"
)
# Sort by cost
sorted_results = sorted(result.results, key=lambda r: r.cost_usd)
print("Models ranked by cost:")
for r in sorted_results:
if not r.error:
print(f"{r.model}: ${r.cost_usd:.4f}")
Quality vs Cost Trade-off
result = template.compare(
models=["gpt-4", "gpt-3.5-turbo", "claude-3-opus", "claude-3-haiku"],
input_text="Complex article requiring nuanced understanding..."
)
# Analyze cost vs output length as proxy for quality
for r in result.results:
if not r.error:
cost_per_token = r.cost_usd / r.tokens_out if r.tokens_out > 0 else 0
print(f"{r.model}:")
print(f" Cost: ${r.cost_usd:.4f}")
print(f" Output length: {len(r.output)} chars")
print(f" Cost per output token: ${cost_per_token:.6f}")
CLI Comparison Modes
Default Table View
dakora compare summarizer \
--models gpt-4,claude-3-opus,gemini-pro \
--input-text "Article..."
Output:
─────────────────────────────────────────────────────────────────
Model │ Response (preview) │ Cost │ Latency
─────────────────────────────────────────────────────────────────
✅gpt-4 │ The article discusses... │ $0.0045 │ 1,234ms
✅claude-3-opus │ This article explores... │ $0.0038 │ 856ms
✅gemini-pro │ A comprehensive over... │ $0.0012 │ 2,145ms
─────────────────────────────────────────────────────────────────
Total Cost: $0.0095 | Success: 3/3
Verbose Mode (Full Responses)
dakora compare summarizer \
--models gpt-4,claude-3-opus \
--input-text "Article..." \
--verbose
Output:
✅ gpt-4 (openai)
Cost: $0.0045 | Latency: 1,234 ms | Tokens: 150 → 80
The article discusses the recent advances in artificial
intelligence and machine learning technologies...
(full response)
────────────────────────────────────────────────────────
✅ claude-3-opus (anthropic)
Cost: $0.0038 | Latency: 856 ms | Tokens: 150 → 75
This article explores cutting-edge developments in AI...
(full response)
────────────────────────────────────────────────────────
Total Cost: $0.0083
Success Rate: 2/2
JSON Mode (Programmatic)
dakora compare summarizer \
--models gpt-4,claude-3-opus \
--input-text "Article..." \
--json
Output:
{
"results": [
{
"model": "gpt-4",
"provider": "openai",
"output": "The article discusses...",
"tokens_in": 150,
"tokens_out": 80,
"cost_usd": 0.0045,
"latency_ms": 1234,
"error": null
},
{
"model": "claude-3-opus",
"provider": "anthropic",
"output": "This article explores...",
"tokens_in": 150,
"tokens_out": 75,
"cost_usd": 0.0038,
"latency_ms": 856,
"error": null
}
],
"total_cost_usd": 0.0083,
"total_tokens_in": 300,
"total_tokens_out": 155,
"successful_count": 2,
"failed_count": 0
}
API Reference
template.compare()
Execute template across multiple models in parallel.
Parameters:
models (list[str], required): List of model identifiers
**kwargs: Template inputs and LLM parameters
Returns:
ComparisonResult: Object containing results and aggregate metrics
Example:
result = template.compare(
models=["gpt-4", "claude-3-opus"],
input_text="Article...",
temperature=0.7
)
ComparisonResult
Attributes:
results (list[ExecutionResult]): List of execution results, one per model
total_cost_usd (float): Sum of costs across successful executions
total_tokens_in (int): Sum of input tokens
total_tokens_out (int): Sum of output tokens
successful_count (int): Number of successful executions
failed_count (int): Number of failed executions
ExecutionResult
Attributes:
output (str): LLM response text
provider (str): Provider name (e.g., “openai”, “anthropic”)
model (str): Model name
tokens_in (int): Input token count
tokens_out (int): Output token count
cost_usd (float): Execution cost in USD
latency_ms (int): Response latency in milliseconds
error (str | None): Error message if execution failed
Best Practices
Model Selection Strategy
- Start broad: Compare diverse models (GPT-4, Claude, Gemini)
- Narrow down: Test variants of best performers (e.g., claude-3-opus vs claude-3-sonnet)
- Consider cost: Factor in both quality and price for your use case
Handling Failures
Always check for errors before using results:
result = template.compare(models=[...], input_text="...")
for r in result.results:
if r.error:
print(f"Warning: {r.model} failed: {r.error}")
continue
# Safe to use r.output
process_output(r.output)
- Limit concurrent models: More than 5-10 models may hit rate limits
- Use appropriate timeouts: Set reasonable
max_tokens to avoid long waits
- Cache results: Store comparison results for expensive evaluations
Cost Management
# Set a cost budget
MAX_COST = 0.10 # $0.10
result = template.compare(
models=["gpt-4", "claude-3-opus", "gemini-pro"],
input_text="Article..."
)
if result.total_cost_usd > MAX_COST:
print(f"⚠️ Cost ${result.total_cost_usd:.4f} exceeds budget ${MAX_COST}")
else:
print(f"✅ Within budget: ${result.total_cost_usd:.4f}")
Integration with Logging
Comparisons are automatically logged when logging is enabled:
# dakora.yaml
logging:
enabled: true
backend: sqlite
db_path: ./dakora.db
Each model execution is logged separately with full metadata.
Next Steps