Skip to main content
AI

Choosing an Embedding Model in 2026

Ravinder··7 min read
AIEmbeddingsRAGVector Search
Share:
Choosing an Embedding Model in 2026

The Default Choice Is Probably Wrong

Most teams reach for OpenAI's text-embedding-ada-002 or its successors because the RAG tutorial they followed used it. Then they wonder why retrieval quality is mediocre on their domain-specific corpus, or why multilingual queries return English results, or why their vector index costs $800/month when they only have 2 million documents.

Embedding model selection has real engineering tradeoffs. Getting them wrong costs money, accuracy, or both. This post cuts through the marketing and gives you a decision framework grounded in what matters: dimensionality, language coverage, fine-tuning ROI, and the caveats behind MTEB scores.


The Decision Tree

Before picking a model, answer four questions in order:

flowchart TD A[Start] --> B{Multilingual corpus?} B -- Yes --> C{> 5 languages?} C -- Yes --> D[multilingual-e5-large\nor mGTE-base] C -- No --> E{Budget-sensitive?} B -- No --> F{Domain-specific vocabulary?} F -- Yes --> G{Labeled pairs available?} G -- Yes --> H[Fine-tune BAAI/bge-base\nor e5-base] G -- No --> I[BAAI/bge-large-en\nor text-embedding-3-small] F -- No --> J{Scale > 50M vectors?} J -- Yes --> K[Matryoshka model\ntruncate to 256–512d] J -- No --> L[text-embedding-3-large\nor e5-large-v2] E -- Yes --> M[nomic-embed-text\nor all-MiniLM-L6-v2] E -- No --> D

Work through this tree before benchmarking anything. It narrows the candidate pool from dozens to two or three models worth testing.


Open vs. Closed Models: The Real Tradeoffs

The closed/open distinction matters less than people think. What matters is: data residency requirements, fine-tuning capability, and latency SLAs.

Closed models (OpenAI, Cohere, Voyage):

  • No self-hosting burden
  • No fine-tuning on your data (with some exceptions via Cohere's fine-tune API)
  • Per-token pricing adds up fast at scale
  • You cannot version-pin; the provider can change model behavior silently

Open models (BAAI/bge, E5, Nomic, GTE):

  • Self-host on GPU or use batch inference APIs
  • Fine-tune on your domain data
  • Fixed behavior — you control the version
  • Operational overhead: serving, monitoring, upgrades

For most teams processing under 10M documents with no special compliance requirements, a closed model is fine. At 50M+ documents, the per-token cost of a closed model typically exceeds the cost of running a medium-sized GPU instance within 3–4 months.


Dimensionality: More Is Not Always Better

The instinct is to pick the highest-dimensional model available. That instinct is frequently wrong.

Higher dimensionality means:

  • Larger index size (linear in dimension)
  • Slower ANN search at the same recall target
  • More parameters → slower inference → higher embedding cost

The practical cutoffs for most tasks:

Use case Recommended dimension
Short passage retrieval (< 512 tokens) 256–512
Long document retrieval 768–1024
Cross-modal or cross-lingual 1024+
Real-time similarity at query time 128–256

Matryoshka Representation Learning (MRL) models solve this elegantly: a single model produces embeddings where any prefix of dimensions is a valid, independently useful embedding. You can truncate to 256d for fast retrieval and re-rank with 1536d for precision.

from sentence_transformers import SentenceTransformer
 
# MRL-capable model: truncate to 256d for ANN, 1024d for rerank
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
 
# Fast retrieval embedding
embedding_256 = model.encode(
    "What is the refund policy?",
    prompt_name="search_query",
    truncate_dim=256,
)
 
# Precision rerank embedding
embedding_1024 = model.encode(
    "What is the refund policy?",
    prompt_name="search_query",
    truncate_dim=1024,
)

If your vector DB supports matryoshka-style two-stage retrieval, use it. You cut index storage by 4x and ANN search cost significantly, with minimal recall loss on most tasks.


Multilingual: Do Not Trust "Supports 100 Languages"

Every multilingual embedding model claims broad language support. That claim hides enormous variance. "Supports" often means "was trained on some text in that language" — not "performs comparably to English retrieval in that language."

Check three things for each language in your corpus:

  1. Token coverage. Run your corpus through the model's tokenizer and measure the average tokens per word. High ratios (> 3) indicate the language is tokenized into sub-word fragments, which degrades semantic coherence.

  2. Cross-lingual recall. Embed a bilingual golden set: 50 query–passage pairs where the query is in language A and the passage is in language B. Compute recall@5. Below 60% means the model is not production-ready for cross-lingual retrieval.

  3. Script handling. CJK (Chinese, Japanese, Korean), Arabic, and Devanagari scripts require specific tokenization. Models trained predominantly on Latin-script corpora often underperform here regardless of what the model card claims.

For serious multilingual needs (5+ languages, cross-lingual queries), intfloat/multilingual-e5-large and Alibaba-NLP/gte-multilingual-base consistently outperform text-embedding-3-large on non-English retrieval despite lower MTEB aggregate scores.


Domain Fine-Tuning: When the ROI Is Positive

Fine-tuning an embedding model on domain-specific data is high-leverage when:

  • Your corpus contains specialized terminology not present in general web crawls (medical, legal, financial, code)
  • Retrieval recall on your golden set is below 70% with the best off-the-shelf model
  • You have at least 1,000 labeled query–positive passage pairs

The fine-tuning setup is straightforward with sentence-transformers:

from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from sentence_transformers.trainer import SentenceTransformerTrainer
from datasets import Dataset
 
# Load a strong base model
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
 
# Your labeled pairs: {query, positive, negative (optional)}
train_dataset = Dataset.from_list([
    {
        "anchor": "What is the max loan-to-value for a jumbo mortgage?",
        "positive": "Jumbo mortgages typically allow LTV ratios up to 80%...",
        "negative": "Conforming loan limits are set annually by the FHFA...",
    },
    # ... 1000+ examples
])
 
loss = losses.MultipleNegativesRankingLoss(model)
 
args = SentenceTransformerTrainingArguments(
    output_dir="./fine-tuned-mortgage-embeddings",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    learning_rate=2e-5,
    warmup_ratio=0.1,
)
 
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

Expect 5–15 percentage point recall improvement on domain queries. If you are not seeing that, your negative examples are too easy — use hard negatives mined from top-k BM25 results.

When the ROI is negative: if your domain vocabulary is covered by general models and your labeled data is sparse (< 500 pairs), fine-tuning often hurts generalization more than it helps precision. Stick with the best off-the-shelf model and invest in retrieval pipeline improvements instead.


Reading MTEB Scores Honestly

MTEB (Massive Text Embedding Benchmark) is the best public reference we have. It is also frequently misread.

What MTEB measures: performance across a diverse set of retrieval, clustering, classification, and semantic textual similarity tasks, averaged into a single score.

What MTEB does not measure:

  • Performance on your domain
  • Latency at your serving scale
  • Behavior on your input length distribution
  • Cross-lingual quality on low-resource languages

A model ranked #5 on MTEB may outperform #1 on your specific task. Always run MTEB scores as a first filter, then benchmark the top 3–4 candidates on a sample of your actual corpus.

The most common MTEB misuse: comparing models with different context windows. A model with a 512-token context will truncate your 2,000-token passages. Its retrieval score is not comparable to a model with an 8k context on the same task.


Quick Reference: Models Worth Evaluating in 2026

Model Dim Context Strengths Use when
text-embedding-3-small 1536 (MRL) 8191 Ease of use, no infra Prototyping, < 5M docs
text-embedding-3-large 3072 (MRL) 8191 Best closed-model quality Quality-first, budget flexible
BAAI/bge-large-en-v1.5 1024 512 Strong English retrieval English-only, self-hosted
intfloat/e5-large-v2 1024 512 Instruction-following Asymmetric retrieval tasks
nomic-ai/nomic-embed-text-v1.5 768 (MRL) 8192 Long context + MRL Long docs, cost-sensitive
multilingual-e5-large 1024 512 Multilingual 3–10 language corpora
Alibaba-NLP/gte-multilingual-base 768 8192 Cross-lingual + long ctx Cross-lingual retrieval

Key Takeaways

  • Default model selection (ada-002 or equivalent) is almost never the optimal choice — use the decision tree before benchmarking.
  • Dimensionality is a cost lever: Matryoshka models let you use 256d for ANN retrieval and 1024d for reranking, cutting storage 4x with minimal recall loss.
  • Multilingual support claims are marketing; always verify token coverage and cross-lingual recall on your specific languages.
  • Domain fine-tuning has positive ROI when you have 1,000+ labeled pairs and a specialized vocabulary — otherwise it often hurts generalization.
  • MTEB scores are a shortlist filter, not a decision — always benchmark on a sample of your actual corpus before committing.
  • At 50M+ vectors, the economics strongly favor self-hosted open models over per-token closed APIs.