Choosing an Embedding Model in 2026

Ravinder·March 9, 2025·7 min read

AIEmbeddingsRAGVector Search

The Default Choice Is Probably Wrong

Most teams reach for OpenAI's text-embedding-ada-002 or its successors because the RAG tutorial they followed used it. Then they wonder why retrieval quality is mediocre on their domain-specific corpus, or why multilingual queries return English results, or why their vector index costs $800/month when they only have 2 million documents.

Embedding model selection has real engineering tradeoffs. Getting them wrong costs money, accuracy, or both. This post cuts through the marketing and gives you a decision framework grounded in what matters: dimensionality, language coverage, fine-tuning ROI, and the caveats behind MTEB scores.

The Decision Tree

Before picking a model, answer four questions in order:

flowchart TD A[Start] --> B{Multilingual corpus?} B -- Yes --> C{> 5 languages?} C -- Yes --> D[multilingual-e5-large\nor mGTE-base] C -- No --> E{Budget-sensitive?} B -- No --> F{Domain-specific vocabulary?} F -- Yes --> G{Labeled pairs available?} G -- Yes --> H[Fine-tune BAAI/bge-base\nor e5-base] G -- No --> I[BAAI/bge-large-en\nor text-embedding-3-small] F -- No --> J{Scale > 50M vectors?} J -- Yes --> K[Matryoshka model\ntruncate to 256–512d] J -- No --> L[text-embedding-3-large\nor e5-large-v2] E -- Yes --> M[nomic-embed-text\nor all-MiniLM-L6-v2] E -- No --> D

Work through this tree before benchmarking anything. It narrows the candidate pool from dozens to two or three models worth testing.

Open vs. Closed Models: The Real Tradeoffs

The closed/open distinction matters less than people think. What matters is: data residency requirements, fine-tuning capability, and latency SLAs.

Closed models (OpenAI, Cohere, Voyage):

No self-hosting burden
No fine-tuning on your data (with some exceptions via Cohere's fine-tune API)
Per-token pricing adds up fast at scale
You cannot version-pin; the provider can change model behavior silently

Open models (BAAI/bge, E5, Nomic, GTE):

Self-host on GPU or use batch inference APIs
Fine-tune on your domain data
Fixed behavior — you control the version
Operational overhead: serving, monitoring, upgrades

For most teams processing under 10M documents with no special compliance requirements, a closed model is fine. At 50M+ documents, the per-token cost of a closed model typically exceeds the cost of running a medium-sized GPU instance within 3–4 months.

Dimensionality: More Is Not Always Better

The instinct is to pick the highest-dimensional model available. That instinct is frequently wrong.

Higher dimensionality means:

Larger index size (linear in dimension)
Slower ANN search at the same recall target
More parameters → slower inference → higher embedding cost

The practical cutoffs for most tasks:

Use case	Recommended dimension
Short passage retrieval (< 512 tokens)	256–512
Long document retrieval	768–1024
Cross-modal or cross-lingual	1024+
Real-time similarity at query time	128–256

Matryoshka Representation Learning (MRL) models solve this elegantly: a single model produces embeddings where any prefix of dimensions is a valid, independently useful embedding. You can truncate to 256d for fast retrieval and re-rank with 1536d for precision.

from sentence_transformers import SentenceTransformer
 
# MRL-capable model: truncate to 256d for ANN, 1024d for rerank
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
 
# Fast retrieval embedding
embedding_256 = model.encode(
    "What is the refund policy?",
    prompt_name="search_query",
    truncate_dim=256,
)
 
# Precision rerank embedding
embedding_1024 = model.encode(
    "What is the refund policy?",
    prompt_name="search_query",
    truncate_dim=1024,
)

If your vector DB supports matryoshka-style two-stage retrieval, use it. You cut index storage by 4x and ANN search cost significantly, with minimal recall loss on most tasks.

Multilingual: Do Not Trust "Supports 100 Languages"

Every multilingual embedding model claims broad language support. That claim hides enormous variance. "Supports" often means "was trained on some text in that language" — not "performs comparably to English retrieval in that language."

Check three things for each language in your corpus:

Token coverage. Run your corpus through the model's tokenizer and measure the average tokens per word. High ratios (> 3) indicate the language is tokenized into sub-word fragments, which degrades semantic coherence.
Cross-lingual recall. Embed a bilingual golden set: 50 query–passage pairs where the query is in language A and the passage is in language B. Compute recall@5. Below 60% means the model is not production-ready for cross-lingual retrieval.
Script handling. CJK (Chinese, Japanese, Korean), Arabic, and Devanagari scripts require specific tokenization. Models trained predominantly on Latin-script corpora often underperform here regardless of what the model card claims.

For serious multilingual needs (5+ languages, cross-lingual queries), intfloat/multilingual-e5-large and Alibaba-NLP/gte-multilingual-base consistently outperform text-embedding-3-large on non-English retrieval despite lower MTEB aggregate scores.

Domain Fine-Tuning: When the ROI Is Positive

Fine-tuning an embedding model on domain-specific data is high-leverage when:

Your corpus contains specialized terminology not present in general web crawls (medical, legal, financial, code)
Retrieval recall on your golden set is below 70% with the best off-the-shelf model
You have at least 1,000 labeled query–positive passage pairs

The fine-tuning setup is straightforward with sentence-transformers:

from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from sentence_transformers.trainer import SentenceTransformerTrainer
from datasets import Dataset
 
# Load a strong base model
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
 
# Your labeled pairs: {query, positive, negative (optional)}
train_dataset = Dataset.from_list([
    {
        "anchor": "What is the max loan-to-value for a jumbo mortgage?",
        "positive": "Jumbo mortgages typically allow LTV ratios up to 80%...",
        "negative": "Conforming loan limits are set annually by the FHFA...",
    },
    # ... 1000+ examples
])
 
loss = losses.MultipleNegativesRankingLoss(model)
 
args = SentenceTransformerTrainingArguments(
    output_dir="./fine-tuned-mortgage-embeddings",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    learning_rate=2e-5,
    warmup_ratio=0.1,
)
 
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

Expect 5–15 percentage point recall improvement on domain queries. If you are not seeing that, your negative examples are too easy — use hard negatives mined from top-k BM25 results.

When the ROI is negative: if your domain vocabulary is covered by general models and your labeled data is sparse (< 500 pairs), fine-tuning often hurts generalization more than it helps precision. Stick with the best off-the-shelf model and invest in retrieval pipeline improvements instead.

Reading MTEB Scores Honestly

MTEB (Massive Text Embedding Benchmark) is the best public reference we have. It is also frequently misread.

What MTEB measures: performance across a diverse set of retrieval, clustering, classification, and semantic textual similarity tasks, averaged into a single score.

What MTEB does not measure:

Performance on your domain
Latency at your serving scale
Behavior on your input length distribution
Cross-lingual quality on low-resource languages

A model ranked #5 on MTEB may outperform #1 on your specific task. Always run MTEB scores as a first filter, then benchmark the top 3–4 candidates on a sample of your actual corpus.

The most common MTEB misuse: comparing models with different context windows. A model with a 512-token context will truncate your 2,000-token passages. Its retrieval score is not comparable to a model with an 8k context on the same task.

Quick Reference: Models Worth Evaluating in 2026

Model	Dim	Context	Strengths	Use when
`text-embedding-3-small`	1536 (MRL)	8191	Ease of use, no infra	Prototyping, < 5M docs
`text-embedding-3-large`	3072 (MRL)	8191	Best closed-model quality	Quality-first, budget flexible
`BAAI/bge-large-en-v1.5`	1024	512	Strong English retrieval	English-only, self-hosted
`intfloat/e5-large-v2`	1024	512	Instruction-following	Asymmetric retrieval tasks
`nomic-ai/nomic-embed-text-v1.5`	768 (MRL)	8192	Long context + MRL	Long docs, cost-sensitive
`multilingual-e5-large`	1024	512	Multilingual	3–10 language corpora
`Alibaba-NLP/gte-multilingual-base`	768	8192	Cross-lingual + long ctx	Cross-lingual retrieval

Key Takeaways

Default model selection (ada-002 or equivalent) is almost never the optimal choice — use the decision tree before benchmarking.
Dimensionality is a cost lever: Matryoshka models let you use 256d for ANN retrieval and 1024d for reranking, cutting storage 4x with minimal recall loss.
Multilingual support claims are marketing; always verify token coverage and cross-lingual recall on your specific languages.
Domain fine-tuning has positive ROI when you have 1,000+ labeled pairs and a specialized vocabulary — otherwise it often hurts generalization.
MTEB scores are a shortlist filter, not a decision — always benchmark on a sample of your actual corpus before committing.
At 50M+ vectors, the economics strongly favor self-hosted open models over per-token closed APIs.