Definition
What is LLM inference cost?
Last updated
Definition
Inference cost is the per-call cost of running an LLM — measured in dollars per million input tokens and dollars per million output tokens, varying widely between models and providers.
A typical agent workload (1,500 input + 500 output tokens per call, 1,000 calls/day) costs roughly $0.10–$30/month per agent depending on model selection. Self-hosted Llama on a shared GPU is roughly $0.30/M tokens; managed APIs like Claude Sonnet are $3/M input / $15/M output; flagship models like Claude Opus or GPT-4o run higher. Use the [Prompt Cost Estimator](/tools/prompt-cost-estimator) to project monthly spend for a specific workload.
Levers for reducing inference cost
- Smaller models for routing — use cheap models (Haiku, GPT-4o-mini) to decide which expensive model to call
- Prompt caching — providers cache repeated input tokens at lower rates
- Streaming + early termination — stop generation when the answer is complete
- RAG over long context — retrieve only the relevant 2K tokens instead of stuffing 200K
Check the Prompt Cost Estimator to model these trade-offs.