Token Usage & Cost Optimization

Strategies to manage and reduce API costs — provider failover, credential pools, context compression, model selection, and cost tracking for Hermes Agent.

TLDR: Control API costs with these hermes cost optimization strategies. Use hermes cost tracking to monitor spending and save money ai agent costs.

Key Takeaways

  • Context compression cuts token usage by 50-80% on long sessions
  • Cheap models (Gemini Flash, DeepSeek, Haiku, Llama) cost 10-100x less than premium
  • Credential pools auto-rotate keys to stay within free tiers
  • Local models via Ollama cost zero API dollars
  • Enable show_cost to see per-turn spending
  • OpenRouter’s free tier models cost nothing

Track Your Spending

First, know what you’re spending:

# Show per-turn cost in the CLI
hermes config set display.show_cost true

# View usage analytics
hermes insights --days 7
hermes insights --days 30

Strategy 1: Context Compression

The biggest cost driver is long context windows. Compress aggressively:

# Enable auto-compression
hermes config set compression.enabled true
hermes config set compression.threshold 0.50   # Trigger at 50% context
hermes config set compression.target_ratio 0.20 # Compress to 20%

# Manual compression mid-session
/compress

Auto-compression summarizes earlier turns when you approach the context limit, reducing tokens by 50-80% on long sessions.

Strategy 2: Smart Model Selection

Not every task needs a premium model:

TaskRecommended modelCost per 1M tokens
Complex coding, deep reasoningClaude Sonnet 4~$15
General chat, researchGPT-4o mini~$0.60
Web search, summariesGemini 2.0 Flash~$0.10
Simple Q&A, data extractionDeepSeek Chat~$0.27
Local, no API costLlama 3.2 via Ollama$0.00

Switch models mid-session:

/model gemini-2.0-flash     # Cheap for simple tasks
/model deepseek/deepseek-chat  # Cheap for research

For cron jobs and subagents, set per-job models:

hermes cron create "30m" --model "gemini-2.0-flash"

Strategy 3: Credential Pools

Avoid hitting expensive overage tiers by pooling multiple free-tier keys:

# Add multiple keys
hermes auth add              # Pick provider, paste key
hermes auth add              # Add another key for same provider

# When one hits its limit, Hermes auto-switches
hermes auth list openrouter

This is especially useful for OpenRouter’s free tier — each key gets a daily allowance, and rotating between them extends it.

Strategy 4: Use the Free Tier

OpenRouter offers free models with rate limits:

hermes model
# Pick: openrouter/free or any model tagged "free"

Models often included in free tier:

  • Meta Llama 3.2 (3B, 8B, 70B)
  • Mistral 7B
  • Google Gemma 2
  • Microsoft Phi-3

These are slower and less capable than paid models, but cost zero.

Strategy 5: Local Models

For zero API cost, run models locally via Ollama:

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a model
ollama pull llama3.2

# Configure Hermes
hermes config set model.default "ollama/llama3.2"
hermes config set model.provider "ollama"

Local models use your own hardware (CPU/GPU) — no API calls, no token costs, no rate limits. The trade-off is speed: expect 10-50 tokens/second on CPU vs 100+ from an API.

Strategy 6: Reduce Output Tokens

# Suppress verbose output
/verbose off

# Use --quiet for one-shot queries
hermes chat -Q -q "Quick question"

# Short, direct prompts save tokens
# "What's the capital of France?" vs "Can you please tell me...?"

Strategy 7: Provider Failover

Set up automatic failover to cheaper providers when expensive ones are down:

# Configure multiple providers
hermes config set model.default "anthropic/claude-sonnet-4"
hermes config set model.fallback "deepseek/deepseek-chat"

If Sonnet is rate-limited or unavailable, Hermes falls back to DeepSeek automatically.

Cost Comparison Table

SetupEstimated monthly cost (moderate use)
All Claude Sonnet 4$50-200
Mix of Sonnet + DeepSeek$20-80
Mix of Gemini Flash + DeepSeek$5-20
OpenRouter free tier only$0-5
Local Ollama only$0 (electricity only)
All of the above with failoverMinimal

FAQ

Q: How do I see how many tokens I’ve used? Use hermes insights --days N for overall stats, or enable display.show_cost for per-turn costs.

Q: Is context compression lossy? Yes — it summarizes earlier turns. Essential context is preserved, but specific details from 50 messages ago may be condensed. Use /compress only when approaching the limit.

Q: Can I set different models for different tools? Not per-tool, but you can per-job (cron) and per-subagent (delegation). Within a session, /model switches everything.

Q: Are free models good enough for daily use? For simple tasks (web search, file operations, Q&A), yes. For complex reasoning and code generation, paid models are noticeably better. Use the hybrid approach — cheap model for routine, premium model for hard tasks.

Next Steps