Quantization for Engineers, Not Researchers
Last quarter I shipped a self-hosted Mistral 7B deployment that ran beautifully in dev, then silently started hallucinating street addresses in production. The root cause: I had swapped to a 4-bit GPTQ checkpoint without re-running our factual recall evals. The model looked fine on general benchmarks. It was broken on the specific distribution that mattered to us.
Quantization is one of those topics where the research literature talks about perplexity on WikiText-2 and engineers ask "will my app break?" This post is the latter.
Why You Quantize in the First Place
A 70B parameter model in FP16 weighs roughly 140 GB. That doesn't fit on a single A100. It doesn't fit on two A100s with the KV cache included. Once you move to INT8 you're at 70 GB. INT4 gets you to 35 GB — suddenly a two-GPU setup becomes viable for inference.
The arithmetic is simple: halving bit-width halves memory. But the accuracy story is not linear, and that's where most teams get burned.
FP32 → 4 bytes/param (training reference, almost never used for inference)
BF16 → 2 bytes/param (default training and inference format today)
FP16 → 2 bytes/param (same size, worse range — avoid for >7B)
INT8 → 1 byte/param (safe for most tasks with per-channel calibration)
INT4 → 0.5 bytes/param (fast, risky, requires task-specific validation)
INT2 → 0.25 bytes/param (research only — too fragile for production)The Precision Tier Map
Constraint?} B -- No --> C[BF16 / FP16
Full Precision] B -- Yes --> D{Acceptable
Quality Floor?} D -- Strict --> E[INT8
bitsandbytes / LLM.int8] D -- Relaxed --> F{Calibration
Data Available?} F -- Yes --> G[AWQ INT4
Best accuracy/size] F -- No --> H[GPTQ INT4
Decent default] C --> I[Validate on
Task Evals] E --> I G --> I H --> I I --> J{Regression?} J -- Yes --> K[Step Up Precision
or Retune] J -- No --> L[Ship It]
The decision is always driven by your eval suite, not by a benchmark someone else ran on different data.
INT8: The Safe Default
INT8 quantization with per-channel calibration is almost always safe. The accuracy loss on most tasks is under 1%. The two dominant approaches are:
LLM.int8() (bitsandbytes) — quantizes weights and activations to INT8, with an outlier detection mechanism that keeps a small fraction of weights in FP16. It's slow on throughput but rock solid on accuracy. Good for low-traffic endpoints where correctness matters more than speed.
SmoothQuant — migrates quantization difficulty from activations to weights via a per-channel scaling factor. Faster than LLM.int8(), compatible with more serving runtimes (vLLM, TGI).
Loading INT8 with bitsandbytes looks like this:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0, # outlier threshold; 6.0 is the paper default
llm_int8_has_fp16_weight=False,
)
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.2",
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.float16,
)Run this, run your evals, compare against the BF16 baseline. If numbers are within tolerance, done.
INT4: Where It Gets Interesting
INT4 cuts memory in half again but introduces enough rounding error that task-specific accuracy degradation becomes real. Two methods dominate:
GPTQ (Post-Training Quantization)
GPTQ uses second-order information (the Hessian) to minimize reconstruction error layer by layer. It doesn't need calibration data from your domain — it calibrates against a generic corpus. That's its strength and its weakness.
# Quantize with AutoGPTQ
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128, # smaller group = better quality, bigger checkpoint
damp_percent=0.1,
desc_act=True, # activation order — slower calibration, better quality
)
model = AutoGPTQForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.2",
quantize_config=quantize_config,
)
# calibration_dataset: list of tokenized sequences, ~512 samples is enough
model.quantize(calibration_dataset)
model.save_quantized("mistral-7b-gptq-4bit", use_safetensors=True)Key lever: group_size. Default 128 is a good tradeoff. Set it to 32 for better accuracy at the cost of a larger checkpoint. Set it to -1 for column-wise quantization — maximum compression, maximum risk.
AWQ (Activation-Aware Weight Quantization)
AWQ is generally better than GPTQ for most tasks. It identifies the 1% of weights that matter most (high activation channels) and protects them from aggressive quantization via scaling.
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "mistralai/Mistral-7B-Instruct-v0.2"
quant_path = "mistral-7b-awq-4bit"
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM", # GEMM for throughput, GEMV for single-token latency
}
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)If you have representative samples from your task domain, pass them as calibration data to AWQ. The improvement over a generic calibration set is measurable on narrow tasks (legal, medical, code-specific).
Performance Gains — What's Real
On an A10G (24 GB VRAM), Mistral 7B serving numbers look roughly like this:
| Format | VRAM | Tokens/sec (bs=1) | Tokens/sec (bs=8) |
|---|---|---|---|
| BF16 | 16 GB | 42 | 210 |
| INT8 | 9 GB | 38 | 195 |
| INT4 (GPTQ) | 5.5 GB | 68 | 340 |
| INT4 (AWQ) | 5.5 GB | 74 | 380 |
Two takeaways: INT8 gives you memory savings without throughput gains (sometimes a slight regression due to dequantization overhead). INT4 gives you both memory and throughput because the kernel fusion during dequantization outweighs the overhead at batch size.
Accuracy Regression Detection
The mistake I made with Mistral was running only perplexity and MMLU benchmarks. Here's what your eval harness actually needs:
1. Task-specific few-shot accuracy Run the same prompts you use in production. If you're a code assistant, run HumanEval. If you're a customer support bot, run your labeled ticket classification suite.
2. Calibration set leakage test GPTQ calibration can memorize calibration examples. Verify your eval set has zero overlap with calibration data.
3. Factual precision regression Use a small factual QA set (TriviaQA subset is fine). A jump in "I don't know" responses is a good sign, but a drop in accuracy with no change in abstention rate means the model is confidently wrong more often.
import json
from transformers import pipeline
def eval_factual_accuracy(model, tokenizer, qa_pairs, threshold=0.85):
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
correct = 0
results = []
for item in qa_pairs:
output = pipe(
item["prompt"],
max_new_tokens=50,
do_sample=False,
temperature=1.0,
)
answer = output[0]["generated_text"].split(item["prompt"])[-1].strip()
hit = item["answer"].lower() in answer.lower()
correct += int(hit)
results.append({"prompt": item["prompt"], "expected": item["answer"],
"got": answer, "correct": hit})
accuracy = correct / len(qa_pairs)
print(f"Accuracy: {accuracy:.2%} (threshold: {threshold:.2%})")
if accuracy < threshold:
print("WARNING: below threshold — do not deploy this checkpoint")
return accuracy, results4. Output distribution shift Compare output length distributions, refusal rates, and confidence-calibration between quantized and full-precision models. A shift in any of these is worth investigating even if task accuracy holds.
Deploy Patterns
vLLM (production throughput)
python -m vllm.entrypoints.openai.api_server \
--model ./mistral-7b-awq-4bit \
--quantization awq \
--dtype half \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--tensor-parallel-size 1Ollama (developer local)
# Ollama uses GGUF format; convert with llama.cpp
python convert-hf-to-gguf.py ./mistral-7b-awq-4bit \
--outfile mistral-7b-q4_K_M.gguf \
--outtype q4_K_M
ollama create mistral-custom -f ./ModelfileChoosing Between GPTQ and AWQ at Deploy Time
AWQ is generally the right default. GPTQ has wider pre-quantized checkpoint availability (TheBloke's HuggingFace repos have most models already quantized), which matters if you can't afford the quantization compute.
If you're quantizing yourself: AWQ for quality, GPTQ if you need desc_act=True with specific serving runtimes that don't support AWQ kernels yet.
When to Step Back Up the Precision Ladder
Don't treat quantization as a one-way door. Set up automated eval gating in your deployment pipeline:
# .github/workflows/model-eval-gate.yml (simplified)
- name: Run eval suite against quantized checkpoint
run: |
python evals/run_suite.py \
--model-path ${{ env.QUANTIZED_MODEL_PATH }} \
--baseline-results evals/baseline_bf16.json \
--max-regression 0.02 # 2% accuracy drop allowed
--fail-on-regressionIf the quantized checkpoint fails, fall back to INT8. If INT8 fails, serve BF16 from a larger instance and revisit the quantization strategy with your calibration dataset.
Key Takeaways
- INT8 (bitsandbytes / SmoothQuant) is almost always safe — run it before reaching for INT4.
- AWQ INT4 beats GPTQ INT4 on most tasks if you have calibration data; GPTQ is fine when you need a pre-quantized checkpoint.
- Throughput gains from INT4 are real (~1.7x on A10G at batch size 8); INT8 saves memory without improving throughput.
- Perplexity and MMLU are insufficient evals — build task-specific accuracy regression tests for your actual distribution.
- Calibration data quality matters for AWQ; domain-specific samples outperform generic corpora for narrow tasks.
- Quantization is reversible — build precision fallback into your deployment pipeline rather than assuming INT4 will always work.