Self-hosted LLM vs OpenAI API — when to choose what

The longer answer

The self-hosted-LLM vs commercial-API decision is one of the most consequential choices in production AI architecture, and the right answer is usually less exciting than the conversation around it suggests.

The default is commercial APIs

Anthropic Claude and OpenAI GPT-4-class models, called via API, are the right default for production AI workloads in 2026. Three reasons. Capability gap. Frontier commercial models (Claude Opus 4.7, GPT-5-class) outperform every open-source alternative on the benchmarks that matter for business workloads — instruction-following, long-context reasoning, code generation, multilingual handling. Operational simplicity. No GPU infrastructure to maintain, no model serving stack, no quantization choices, no scaling problems. Cost predictability. Per-token pricing with no fixed infrastructure cost. A workload that spikes 10x in volume costs 10x more; a workload that drops 90% costs 90% less.

When self-hosting is the right answer

Compliance-driven. HIPAA workloads where the BAA with the commercial provider is insufficient (rare in 2026 but real for some buyers); SOC 2 workloads where the buyer's auditors require data-never-leaves-perimeter; FedRAMP and similar government compliance; data-residency requirements (the data cannot leave a specific geography for regulatory reasons). This is the most common honest reason to self-host.

Specific capability requirements. When the use case requires a fine-tuned model in a domain commercial APIs do not serve well (some legal-specific tasks, some highly-specialized scientific domains, some niche-language work). Less common but real.

Genuinely high volume at predictable load. A workload doing 100M+ queries/month at predictable steady-state load might amortize the GPU infrastructure cost favorably vs per-token API pricing. The threshold is high; below 10M queries/month the API math almost always wins.

When self-hosting disappoints

Buyers chasing self-hosting for cost reasons are usually disappointed in the first 18 months. The reasons: GPU infrastructure operational complexity is real (one engineer-month per month is a reasonable starting estimate); the open-source model quality lags the frontier by 6-18 months on benchmarks that matter; the cost math only works at sustained high volume. For most production workloads in 2026, commercial APIs win on total cost of ownership for the first 2-3 years.

The hybrid pattern

Many production deployments use commercial APIs for the long tail of capability-demanding queries and a self-hosted small model (Llama 3.x 8B-class, Mistral 7B-class) for high-volume low-complexity queries. The router decides which model handles each query based on complexity heuristics. This pattern captures most of the cost-savings benefit without paying the full operational tax of self-hosting everything.

Common follow-up questions

What about open-source models I can self-host?

Llama 3.x, Mistral, DeepSeek, Qwen, and others are credible self-hosted options in 2026. The capability gap to frontier commercial models has narrowed but is not closed. For workloads where the frontier capability is not needed, the open-source models are production-ready.

Is Ollama enough for production?

For small-scale internal deployments, yes. For production customer-facing workloads with real throughput requirements, no — vLLM, TGI (Text Generation Inference), or a managed equivalent is the production serving layer.

Can I switch from API to self-hosted later?

Yes, and many buyers do. Start on commercial APIs, prove the use case, instrument the production traffic, then evaluate whether self-hosting makes sense based on actual usage data. Avoid making the decision before you have production data.