Choosing an Embedding Model in 2026
The Default Choice Is Probably Wrong
Most teams reach for OpenAI's text-embedding-ada-002 or its successors because the RAG tutorial they followed used it. Then they wonder why retrieval quality is mediocre on their domain-specific corpus, or why multilingual queries return English results, or why their vector index costs $800/month when they only have 2 million documents.
Embedding model selection has real engineering tradeoffs. Getting them wrong costs money, accuracy, or both. This post cuts through the marketing and gives you a decision framework grounded in what matters: dimensionality, language coverage, fine-tuning ROI, and the caveats behind MTEB scores.
The Decision Tree
Before picking a model, answer four questions in order:
Work through this tree before benchmarking anything. It narrows the candidate pool from dozens to two or three models worth testing.
Open vs. Closed Models: The Real Tradeoffs
The closed/open distinction matters less than people think. What matters is: data residency requirements, fine-tuning capability, and latency SLAs.
Closed models (OpenAI, Cohere, Voyage):
- No self-hosting burden
- No fine-tuning on your data (with some exceptions via Cohere's fine-tune API)
- Per-token pricing adds up fast at scale
- You cannot version-pin; the provider can change model behavior silently
Open models (BAAI/bge, E5, Nomic, GTE):
- Self-host on GPU or use batch inference APIs
- Fine-tune on your domain data
- Fixed behavior — you control the version
- Operational overhead: serving, monitoring, upgrades
For most teams processing under 10M documents with no special compliance requirements, a closed model is fine. At 50M+ documents, the per-token cost of a closed model typically exceeds the cost of running a medium-sized GPU instance within 3–4 months.
Dimensionality: More Is Not Always Better
The instinct is to pick the highest-dimensional model available. That instinct is frequently wrong.
Higher dimensionality means:
- Larger index size (linear in dimension)
- Slower ANN search at the same recall target
- More parameters → slower inference → higher embedding cost
The practical cutoffs for most tasks:
| Use case | Recommended dimension |
|---|---|
| Short passage retrieval (< 512 tokens) | 256–512 |
| Long document retrieval | 768–1024 |
| Cross-modal or cross-lingual | 1024+ |
| Real-time similarity at query time | 128–256 |
Matryoshka Representation Learning (MRL) models solve this elegantly: a single model produces embeddings where any prefix of dimensions is a valid, independently useful embedding. You can truncate to 256d for fast retrieval and re-rank with 1536d for precision.
from sentence_transformers import SentenceTransformer
# MRL-capable model: truncate to 256d for ANN, 1024d for rerank
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
# Fast retrieval embedding
embedding_256 = model.encode(
"What is the refund policy?",
prompt_name="search_query",
truncate_dim=256,
)
# Precision rerank embedding
embedding_1024 = model.encode(
"What is the refund policy?",
prompt_name="search_query",
truncate_dim=1024,
)If your vector DB supports matryoshka-style two-stage retrieval, use it. You cut index storage by 4x and ANN search cost significantly, with minimal recall loss on most tasks.
Multilingual: Do Not Trust "Supports 100 Languages"
Every multilingual embedding model claims broad language support. That claim hides enormous variance. "Supports" often means "was trained on some text in that language" — not "performs comparably to English retrieval in that language."
Check three things for each language in your corpus:
-
Token coverage. Run your corpus through the model's tokenizer and measure the average tokens per word. High ratios (> 3) indicate the language is tokenized into sub-word fragments, which degrades semantic coherence.
-
Cross-lingual recall. Embed a bilingual golden set: 50 query–passage pairs where the query is in language A and the passage is in language B. Compute recall@5. Below 60% means the model is not production-ready for cross-lingual retrieval.
-
Script handling. CJK (Chinese, Japanese, Korean), Arabic, and Devanagari scripts require specific tokenization. Models trained predominantly on Latin-script corpora often underperform here regardless of what the model card claims.
For serious multilingual needs (5+ languages, cross-lingual queries), intfloat/multilingual-e5-large and Alibaba-NLP/gte-multilingual-base consistently outperform text-embedding-3-large on non-English retrieval despite lower MTEB aggregate scores.
Domain Fine-Tuning: When the ROI Is Positive
Fine-tuning an embedding model on domain-specific data is high-leverage when:
- Your corpus contains specialized terminology not present in general web crawls (medical, legal, financial, code)
- Retrieval recall on your golden set is below 70% with the best off-the-shelf model
- You have at least 1,000 labeled query–positive passage pairs
The fine-tuning setup is straightforward with sentence-transformers:
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from sentence_transformers.trainer import SentenceTransformerTrainer
from datasets import Dataset
# Load a strong base model
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
# Your labeled pairs: {query, positive, negative (optional)}
train_dataset = Dataset.from_list([
{
"anchor": "What is the max loan-to-value for a jumbo mortgage?",
"positive": "Jumbo mortgages typically allow LTV ratios up to 80%...",
"negative": "Conforming loan limits are set annually by the FHFA...",
},
# ... 1000+ examples
])
loss = losses.MultipleNegativesRankingLoss(model)
args = SentenceTransformerTrainingArguments(
output_dir="./fine-tuned-mortgage-embeddings",
num_train_epochs=3,
per_device_train_batch_size=32,
learning_rate=2e-5,
warmup_ratio=0.1,
)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()Expect 5–15 percentage point recall improvement on domain queries. If you are not seeing that, your negative examples are too easy — use hard negatives mined from top-k BM25 results.
When the ROI is negative: if your domain vocabulary is covered by general models and your labeled data is sparse (< 500 pairs), fine-tuning often hurts generalization more than it helps precision. Stick with the best off-the-shelf model and invest in retrieval pipeline improvements instead.
Reading MTEB Scores Honestly
MTEB (Massive Text Embedding Benchmark) is the best public reference we have. It is also frequently misread.
What MTEB measures: performance across a diverse set of retrieval, clustering, classification, and semantic textual similarity tasks, averaged into a single score.
What MTEB does not measure:
- Performance on your domain
- Latency at your serving scale
- Behavior on your input length distribution
- Cross-lingual quality on low-resource languages
A model ranked #5 on MTEB may outperform #1 on your specific task. Always run MTEB scores as a first filter, then benchmark the top 3–4 candidates on a sample of your actual corpus.
The most common MTEB misuse: comparing models with different context windows. A model with a 512-token context will truncate your 2,000-token passages. Its retrieval score is not comparable to a model with an 8k context on the same task.
Quick Reference: Models Worth Evaluating in 2026
| Model | Dim | Context | Strengths | Use when |
|---|---|---|---|---|
text-embedding-3-small |
1536 (MRL) | 8191 | Ease of use, no infra | Prototyping, < 5M docs |
text-embedding-3-large |
3072 (MRL) | 8191 | Best closed-model quality | Quality-first, budget flexible |
BAAI/bge-large-en-v1.5 |
1024 | 512 | Strong English retrieval | English-only, self-hosted |
intfloat/e5-large-v2 |
1024 | 512 | Instruction-following | Asymmetric retrieval tasks |
nomic-ai/nomic-embed-text-v1.5 |
768 (MRL) | 8192 | Long context + MRL | Long docs, cost-sensitive |
multilingual-e5-large |
1024 | 512 | Multilingual | 3–10 language corpora |
Alibaba-NLP/gte-multilingual-base |
768 | 8192 | Cross-lingual + long ctx | Cross-lingual retrieval |
Key Takeaways
- Default model selection (ada-002 or equivalent) is almost never the optimal choice — use the decision tree before benchmarking.
- Dimensionality is a cost lever: Matryoshka models let you use 256d for ANN retrieval and 1024d for reranking, cutting storage 4x with minimal recall loss.
- Multilingual support claims are marketing; always verify token coverage and cross-lingual recall on your specific languages.
- Domain fine-tuning has positive ROI when you have 1,000+ labeled pairs and a specialized vocabulary — otherwise it often hurts generalization.
- MTEB scores are a shortlist filter, not a decision — always benchmark on a sample of your actual corpus before committing.
- At 50M+ vectors, the economics strongly favor self-hosted open models over per-token closed APIs.