Quantization for Engineers, Not Researchers

Ravinder·May 4, 2025·8 min read

AILLMQuantizationPerformanceSelf-Hosting

Quantization for Engineers, Not Researchers

Last quarter I shipped a self-hosted Mistral 7B deployment that ran beautifully in dev, then silently started hallucinating street addresses in production. The root cause: I had swapped to a 4-bit GPTQ checkpoint without re-running our factual recall evals. The model looked fine on general benchmarks. It was broken on the specific distribution that mattered to us.

Quantization is one of those topics where the research literature talks about perplexity on WikiText-2 and engineers ask "will my app break?" This post is the latter.

Why You Quantize in the First Place

A 70B parameter model in FP16 weighs roughly 140 GB. That doesn't fit on a single A100. It doesn't fit on two A100s with the KV cache included. Once you move to INT8 you're at 70 GB. INT4 gets you to 35 GB — suddenly a two-GPU setup becomes viable for inference.

The arithmetic is simple: halving bit-width halves memory. But the accuracy story is not linear, and that's where most teams get burned.

FP32  → 4 bytes/param  (training reference, almost never used for inference)
BF16  → 2 bytes/param  (default training and inference format today)
FP16  → 2 bytes/param  (same size, worse range — avoid for >7B)
INT8  → 1 byte/param   (safe for most tasks with per-channel calibration)
INT4  → 0.5 bytes/param (fast, risky, requires task-specific validation)
INT2  → 0.25 bytes/param (research only — too fragile for production)

The Precision Tier Map

graph TD A[Task] --> B{Latency/Cost
Constraint?} B -- No --> C[BF16 / FP16
Full Precision] B -- Yes --> D{Acceptable
Quality Floor?} D -- Strict --> E[INT8
bitsandbytes / LLM.int8] D -- Relaxed --> F{Calibration
Data Available?} F -- Yes --> G[AWQ INT4
Best accuracy/size] F -- No --> H[GPTQ INT4
Decent default] C --> I[Validate on
Task Evals] E --> I G --> I H --> I I --> J{Regression?} J -- Yes --> K[Step Up Precision
or Retune] J -- No --> L[Ship It]

The decision is always driven by your eval suite, not by a benchmark someone else ran on different data.

INT8: The Safe Default

INT8 quantization with per-channel calibration is almost always safe. The accuracy loss on most tasks is under 1%. The two dominant approaches are:

LLM.int8() (bitsandbytes) — quantizes weights and activations to INT8, with an outlier detection mechanism that keeps a small fraction of weights in FP16. It's slow on throughput but rock solid on accuracy. Good for low-traffic endpoints where correctness matters more than speed.

SmoothQuant — migrates quantization difficulty from activations to weights via a per-channel scaling factor. Faster than LLM.int8(), compatible with more serving runtimes (vLLM, TGI).

Loading INT8 with bitsandbytes looks like this:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
 
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,          # outlier threshold; 6.0 is the paper default
    llm_int8_has_fp16_weight=False,
)
 
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.float16,
)

Run this, run your evals, compare against the BF16 baseline. If numbers are within tolerance, done.

INT4: Where It Gets Interesting

INT4 cuts memory in half again but introduces enough rounding error that task-specific accuracy degradation becomes real. Two methods dominate:

GPTQ (Post-Training Quantization)

GPTQ uses second-order information (the Hessian) to minimize reconstruction error layer by layer. It doesn't need calibration data from your domain — it calibrates against a generic corpus. That's its strength and its weakness.

# Quantize with AutoGPTQ
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
 
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,      # smaller group = better quality, bigger checkpoint
    damp_percent=0.1,
    desc_act=True,       # activation order — slower calibration, better quality
)
 
model = AutoGPTQForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    quantize_config=quantize_config,
)
 
# calibration_dataset: list of tokenized sequences, ~512 samples is enough
model.quantize(calibration_dataset)
model.save_quantized("mistral-7b-gptq-4bit", use_safetensors=True)

Key lever: group_size. Default 128 is a good tradeoff. Set it to 32 for better accuracy at the cost of a larger checkpoint. Set it to -1 for column-wise quantization — maximum compression, maximum risk.

AWQ (Activation-Aware Weight Quantization)

AWQ is generally better than GPTQ for most tasks. It identifies the 1% of weights that matter most (high activation channels) and protects them from aggressive quantization via scaling.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
 
model_path = "mistralai/Mistral-7B-Instruct-v0.2"
quant_path = "mistral-7b-awq-4bit"
 
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM",   # GEMM for throughput, GEMV for single-token latency
}
 
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
 
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

If you have representative samples from your task domain, pass them as calibration data to AWQ. The improvement over a generic calibration set is measurable on narrow tasks (legal, medical, code-specific).

Performance Gains — What's Real

On an A10G (24 GB VRAM), Mistral 7B serving numbers look roughly like this:

Format	VRAM	Tokens/sec (bs=1)	Tokens/sec (bs=8)
BF16	16 GB	42	210
INT8	9 GB	38	195
INT4 (GPTQ)	5.5 GB	68	340
INT4 (AWQ)	5.5 GB	74	380

Two takeaways: INT8 gives you memory savings without throughput gains (sometimes a slight regression due to dequantization overhead). INT4 gives you both memory and throughput because the kernel fusion during dequantization outweighs the overhead at batch size.

Accuracy Regression Detection

The mistake I made with Mistral was running only perplexity and MMLU benchmarks. Here's what your eval harness actually needs:

1. Task-specific few-shot accuracy Run the same prompts you use in production. If you're a code assistant, run HumanEval. If you're a customer support bot, run your labeled ticket classification suite.

2. Calibration set leakage test GPTQ calibration can memorize calibration examples. Verify your eval set has zero overlap with calibration data.

3. Factual precision regression Use a small factual QA set (TriviaQA subset is fine). A jump in "I don't know" responses is a good sign, but a drop in accuracy with no change in abstention rate means the model is confidently wrong more often.

import json
from transformers import pipeline
 
def eval_factual_accuracy(model, tokenizer, qa_pairs, threshold=0.85):
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
    correct = 0
    results = []
 
    for item in qa_pairs:
        output = pipe(
            item["prompt"],
            max_new_tokens=50,
            do_sample=False,
            temperature=1.0,
        )
        answer = output[0]["generated_text"].split(item["prompt"])[-1].strip()
        hit = item["answer"].lower() in answer.lower()
        correct += int(hit)
        results.append({"prompt": item["prompt"], "expected": item["answer"],
                        "got": answer, "correct": hit})
 
    accuracy = correct / len(qa_pairs)
    print(f"Accuracy: {accuracy:.2%} (threshold: {threshold:.2%})")
    if accuracy < threshold:
        print("WARNING: below threshold — do not deploy this checkpoint")
 
    return accuracy, results

4. Output distribution shift Compare output length distributions, refusal rates, and confidence-calibration between quantized and full-precision models. A shift in any of these is worth investigating even if task accuracy holds.

Deploy Patterns

vLLM (production throughput)

python -m vllm.entrypoints.openai.api_server \
  --model ./mistral-7b-awq-4bit \
  --quantization awq \
  --dtype half \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --tensor-parallel-size 1

Ollama (developer local)

# Ollama uses GGUF format; convert with llama.cpp
python convert-hf-to-gguf.py ./mistral-7b-awq-4bit \
  --outfile mistral-7b-q4_K_M.gguf \
  --outtype q4_K_M
 
ollama create mistral-custom -f ./Modelfile

Choosing Between GPTQ and AWQ at Deploy Time

AWQ is generally the right default. GPTQ has wider pre-quantized checkpoint availability (TheBloke's HuggingFace repos have most models already quantized), which matters if you can't afford the quantization compute.

If you're quantizing yourself: AWQ for quality, GPTQ if you need desc_act=True with specific serving runtimes that don't support AWQ kernels yet.

When to Step Back Up the Precision Ladder

Don't treat quantization as a one-way door. Set up automated eval gating in your deployment pipeline:

# .github/workflows/model-eval-gate.yml (simplified)
- name: Run eval suite against quantized checkpoint
  run: |
    python evals/run_suite.py \
      --model-path ${{ env.QUANTIZED_MODEL_PATH }} \
      --baseline-results evals/baseline_bf16.json \
      --max-regression 0.02    # 2% accuracy drop allowed
      --fail-on-regression

If the quantized checkpoint fails, fall back to INT8. If INT8 fails, serve BF16 from a larger instance and revisit the quantization strategy with your calibration dataset.

Key Takeaways

INT8 (bitsandbytes / SmoothQuant) is almost always safe — run it before reaching for INT4.
AWQ INT4 beats GPTQ INT4 on most tasks if you have calibration data; GPTQ is fine when you need a pre-quantized checkpoint.
Throughput gains from INT4 are real (~1.7x on A10G at batch size 8); INT8 saves memory without improving throughput.
Perplexity and MMLU are insufficient evals — build task-specific accuracy regression tests for your actual distribution.
Calibration data quality matters for AWQ; domain-specific samples outperform generic corpora for narrow tasks.
Quantization is reversible — build precision fallback into your deployment pipeline rather than assuming INT4 will always work.