Definition

What is LLM inference cost?

Last updated May 7, 2026

Definition

Inference cost is the per-call cost of running an LLM — measured in dollars per million input tokens and dollars per million output tokens, varying widely between models and providers.

A typical agent workload (1,500 input + 500 output tokens per call, 1,000 calls/day) costs roughly $0.10–$30/month per agent depending on model selection. Self-hosted Llama on a shared GPU is roughly $0.30/M tokens; managed APIs like Claude Sonnet are $3/M input / $15/M output; flagship models like Claude Opus or GPT-4o run higher. Use the [Prompt Cost Estimator](/tools/prompt-cost-estimator) to project monthly spend for a specific workload.

Levers for reducing inference cost

Smaller models for routing — use cheap models (Haiku, GPT-4o-mini) to decide which expensive model to call
Prompt caching — providers cache repeated input tokens at lower rates
Streaming + early termination — stop generation when the answer is complete
RAG over long context — retrieve only the relevant 2K tokens instead of stuffing 200K

Check the Prompt Cost Estimator to model these trade-offs.

Related terms

Sources

OpenAI pricing