Skip to main content
AI

Self-Hosted LLMs: Total Cost of Ownership Beyond the GPU Bill

Ravinder··8 min read
AILLMCost OptimizationGPUSelf-Hosting
Share:
Self-Hosted LLMs: Total Cost of Ownership Beyond the GPU Bill

Every month, a team somewhere calculates that their API spend on GPT-4 is $40k/month, discovers that an A100 server costs $10k/month to rent, and decides to self-host. Six months later they are spending $35k/month on compute, $20k/month in engineering time, and the system handles half the throughput they budgeted for. The break-even math looked clean on a spreadsheet. Reality added line items.

This is a guide to doing the math honestly.

What You Are Actually Buying

When you self-host an LLM, you are buying four things, not one:

  1. Compute capacity: GPU hours, memory, interconnect
  2. Inference infrastructure: the serving layer that converts GPU capacity into a usable API
  3. Operational overhead: the human time to keep it running, secure, scaled, and current
  4. Model flexibility: the ability to use models that are not available via API

The first is visible and easy to budget. The other three are where TCO calculations go wrong.

GPU Sizing: The Memory Constraint

GPU memory is the binding constraint. You need enough VRAM to hold the model weights plus the KV cache for concurrent requests.

def calculate_vram_requirement(
    model_params_billion: float,
    precision: str = "fp16",          # "fp32", "fp16", "int8", "int4"
    max_concurrent_requests: int = 32,
    max_context_length: int = 4096,
    num_layers: int = 32,              # model-specific
    num_kv_heads: int = 8,
    head_dim: int = 128,
) -> dict:
    bytes_per_param = {"fp32": 4, "fp16": 2, "int8": 1, "int4": 0.5}[precision]
 
    # Model weights
    weights_gb = model_params_billion * 1e9 * bytes_per_param / 1e9
 
    # KV cache: 2 (K+V) * layers * heads * head_dim * seq_len * concurrent * bytes_per_element
    kv_bytes = (
        2 * num_layers * num_kv_heads * head_dim
        * max_context_length * max_concurrent_requests * bytes_per_param
    )
    kv_cache_gb = kv_bytes / 1e9
 
    # Overhead: activations, CUDA buffers, etc. (~15%)
    overhead_gb = (weights_gb + kv_cache_gb) * 0.15
 
    total_gb = weights_gb + kv_cache_gb + overhead_gb
 
    return {
        "weights_gb": round(weights_gb, 1),
        "kv_cache_gb": round(kv_cache_gb, 1),
        "overhead_gb": round(overhead_gb, 1),
        "total_required_gb": round(total_gb, 1),
        "a100_80gb_count": max(1, -(-total_gb // 80)),   # ceiling division
        "h100_80gb_count": max(1, -(-total_gb // 80)),
    }
 
# Llama 3.1 70B at fp16, 32 concurrent requests, 4k context
result = calculate_vram_requirement(
    model_params_billion=70,
    precision="fp16",
    max_concurrent_requests=32,
    max_context_length=4096,
    num_layers=80,
    num_kv_heads=8,
    head_dim=128
)
print(result)
# weights_gb: 140.0, kv_cache_gb: ~67.1, total: ~238GB → 3× A100 80GB minimum

This is why "just rent one A100" does not work for serious models. A 70B model at fp16 needs 3–4 A100s before you have room for any concurrent requests.

Throughput Math: Tokens Per Second Per Dollar

Raw tokens/second is not the number you should optimize for. Optimize for tokens/second/$. This lets you compare across GPU SKUs, quantization levels, and serving frameworks.

def compute_tps_per_dollar(
    gpu_name: str,
    gpu_count: int,
    hourly_cost_usd: float,           # total cluster cost per hour
    measured_output_tps: float,        # tokens per second at target concurrency
) -> dict:
    monthly_cost = hourly_cost_usd * 24 * 30
    tpm = measured_output_tps * 60    # tokens per minute
    monthly_tokens = tpm * 60 * 24 * 30
 
    return {
        "config": f"{gpu_count}× {gpu_name}",
        "monthly_cost_usd": round(monthly_cost, 0),
        "output_tps": measured_output_tps,
        "tps_per_dollar_per_hour": round(measured_output_tps / hourly_cost_usd, 2),
        "cost_per_million_output_tokens": round(monthly_cost / (monthly_tokens / 1e6), 2),
    }
 
# Typical benchmarks for Llama 3.1 70B with vLLM (approximate, verify for your workload)
configs = [
    compute_tps_per_dollar("A100 80GB", 4, 14.0, 180),    # ~$3.50/hr per A100
    compute_tps_per_dollar("H100 80GB", 2, 12.0, 220),    # ~$6.00/hr per H100
    compute_tps_per_dollar("A10G 24GB", 8, 9.6, 95),      # ~$1.20/hr per A10G (quantized)
]
 
for c in configs:
    print(c)

Compare this against your current API costs. If you are paying $15/million output tokens for GPT-4, and your self-hosted setup comes in at $8/million output tokens but requires 0.5 FTE to maintain, the break-even is not where you think it is.

vLLM vs TGI: Choosing Your Serving Stack

The two dominant open-source serving frameworks are vLLM and Text Generation Inference (TGI) from Hugging Face. Both are production-grade. The choice comes down to your constraints.

flowchart TD A[Choose Inference Stack] --> B{Primary concern?} B -- Maximum throughput\nand concurrency --> C[vLLM] B -- Hugging Face model\necosystem integration --> D[TGI] B -- Multi-model serving\nor LoRA hot-swap --> E[vLLM with LoRA support] B -- Quantized models\non consumer GPUs --> F[llama.cpp or Ollama] C --> G{Continuous batching\nand paged attention\nare requirements?} G -- Yes --> H[vLLM — best-in-class\nfor these features] G -- No --> I[Either works] D --> J{Need tensor parallelism\nacross > 2 GPUs?} J -- Yes --> K[vLLM handles this\nmore reliably at scale] J -- No --> L[TGI is fine]

vLLM excels at throughput. Its paged attention implementation and continuous batching mean that requests are batched dynamically as they arrive — you are not waiting for a fixed batch to fill before processing. At high concurrency (>50 concurrent requests), vLLM typically outperforms TGI by 20–40%.

TGI integrates tightly with the Hugging Face Hub and has better support for some exotic model architectures. The quantization support (AWQ, GPTQ) is mature.

For most production deployments at scale: use vLLM.

# vLLM server launch — production configuration
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 8192 \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.92 \
  --enable-chunked-prefill \
  --disable-log-requests \
  --host 0.0.0.0 \
  --port 8000

Note --max-num-seqs 256: this controls max concurrent sequences in the KV cache. Set it too high and you OOM. Set it too low and you underutilize GPU memory. Start at 128, load test, and increase until you see memory pressure.

The Hidden Ops Bill

This is the section that the "self-hosting saves money" posts skip.

On-call burden. GPU inference servers crash in ways that are not well-documented. NCCL hangs during tensor parallelism. OOM conditions are not always graceful. Model loading fails after CUDA driver updates. Someone needs to be paged at 2am. Budget 0.25–0.5 FTE per cluster.

Model update cycles. New model versions drop frequently. Evaluating a new model, quantizing it, testing throughput, updating your serving config, and deploying it takes 2–5 engineering days per update. If you want to stay within 1–2 major versions of state-of-the-art, this is quarterly or more frequent work.

Security and compliance. Your self-hosted inference endpoint is a network service that needs authentication, rate limiting, audit logging, and network isolation. If you are in a regulated industry, add compliance overhead on top.

Hardware failures. A100s and H100s fail. Provisioning replacements from cloud providers takes hours to days. Build this into your availability SLA calculations.

def full_tco_monthly(
    gpu_cost_monthly: float,          # pure compute cost
    engineering_fte_fraction: float,  # e.g. 0.4 for 40% of one engineer's time
    avg_engineer_monthly_cost: float, # loaded cost including benefits
    networking_storage_monthly: float,
    monitoring_tools_monthly: float,
    compliance_overhead_monthly: float,
) -> dict:
    ops_cost = engineering_fte_fraction * avg_engineer_monthly_cost
    total = (gpu_cost_monthly + ops_cost + networking_storage_monthly
             + monitoring_tools_monthly + compliance_overhead_monthly)
 
    return {
        "gpu_compute": gpu_cost_monthly,
        "engineering_ops": ops_cost,
        "networking_storage": networking_storage_monthly,
        "monitoring": monitoring_tools_monthly,
        "compliance": compliance_overhead_monthly,
        "total_monthly": total,
        "gpu_as_pct_of_tco": round(gpu_cost_monthly / total * 100, 1),
    }
 
# Realistic example: 4× A100 cluster for 70B model
result = full_tco_monthly(
    gpu_cost_monthly=10_080,          # 4× A100 on-demand at $3.50/hr
    engineering_fte_fraction=0.4,
    avg_engineer_monthly_cost=20_000, # $240k/yr loaded
    networking_storage_monthly=400,
    monitoring_tools_monthly=300,
    compliance_overhead_monthly=500,
)
# total_monthly: ~$19,280, gpu_as_pct_of_tco: 52.3%

The GPU cost is typically 45–60% of actual TCO. Engineers like to forget the rest.

Break-Even Analysis

The break-even against API providers depends on your output token volume and the gap between your self-hosted cost-per-token and the API rate.

def break_even_tokens_per_month(
    self_hosted_monthly_fixed_cost: float,
    self_hosted_cost_per_million_tokens: float,
    api_cost_per_million_tokens: float,
) -> float:
    """
    At break-even: fixed_cost + self_hosted_rate × volume = api_rate × volume
    Solving for volume:
    fixed_cost = (api_rate - self_hosted_rate) × volume
    volume = fixed_cost / (api_rate - self_hosted_rate)
    """
    if api_cost_per_million_tokens <= self_hosted_cost_per_million_tokens:
        return float("inf")  # Never breaks even
 
    savings_per_million = api_cost_per_million_tokens - self_hosted_cost_per_million_tokens
    breakeven_millions = self_hosted_monthly_fixed_cost / savings_per_million
    return breakeven_millions * 1e6
 
# Example: $19k/month TCO, $3/M token self-hosted vs $15/M API
breakeven = break_even_tokens_per_month(19_000, 3.0, 15.0)
print(f"Break-even at {breakeven/1e9:.1f}B tokens/month")
# Break-even at 1.6B tokens/month

1.6 billion output tokens per month is approximately 3 million requests at 500 tokens each. That is a meaningful scale. Below that volume, the API is likely cheaper when you account for full TCO.

When Self-Hosting Is the Right Call

Despite the cost math, self-hosting genuinely wins in specific situations:

  • Data residency requirements: regulated industries where data cannot leave your infrastructure
  • Model customization: you have fine-tuned weights that are not deployable on provider APIs
  • Extreme latency requirements: sub-100ms inference that requires co-located GPU and application
  • Long-running heavy workloads: batch processing jobs that run 24/7 at consistent utilization

At consistent 70%+ GPU utilization 24/7, reserved or spot instances make the economics work. The trap is budgeting for peak capacity that sits idle 16 hours a day.

Key Takeaways

  • GPU compute is typically 45–60% of actual self-hosting TCO; engineering ops, compliance, and infrastructure make up the rest — budget for all of it.
  • VRAM is the binding constraint: a 70B model at fp16 needs 140GB for weights alone before any KV cache, requiring 3–4 A100 80GB GPUs minimum.
  • vLLM's paged attention and continuous batching deliver 20–40% higher throughput than TGI at high concurrency — use it for production serving at scale.
  • The break-even volume against cloud APIs is typically 1–3 billion output tokens per month when full TCO is accounted for; most teams are below this threshold.
  • Self-hosting is clearly justified for data residency requirements, custom fine-tuned weights, and workloads with consistent 70%+ GPU utilization.
  • Build an on-call rotation into your operational plan before launch — GPU inference servers fail in undocumented ways and need human response at odd hours.