Hybrid Search: BM25 + Vectors Without the Hand-Waving

Ravinder·January 19, 2025·8 min read

AISearchBM25Vector SearchRAG

Every RAG tutorial says "use hybrid search." Almost none of them tell you how to fuse the results, how to weight the signals, or when BM25 alone is the right answer and vector search is just overhead.

Hybrid search is not "run BM25 and vectors and combine somehow." The fusion method and weights are the product. Get them wrong and you get something worse than either modality alone.

What Each Modality Actually Does

Before you fuse anything, you need a precise mental model of what you're fusing.

BM25 is a probabilistic term-frequency model. It scores documents by exact and near-exact term overlap with the query. It's fast, deterministic, has no latency variance, and requires no GPU. It handles:

Exact product names, error codes, serial numbers
Rare vocabulary not seen during embedding training
Short, keyword-style queries

Vector search embeds both query and documents into a continuous semantic space. Similar meaning clusters together regardless of vocabulary overlap. It handles:

Paraphrase and synonym matching
Conceptual questions where the user doesn't know the exact terminology
Cross-lingual retrieval when using multilingual embedders

Neither dominates. The failures are complementary:

Query	BM25	Vector
"ERR_SSL_PROTOCOL_ERROR"	Exact match	May miss if rare in training
"Why does my connection keep dropping?"	Misses paraphrases	Semantic match across phrasing
"myocardial infarction" vs "heart attack"	Misses synonym	Handles it well
"GPT-4o release date"	Finds it if in corpus	May confuse with similar entities

The Four Fusion Strategies

flowchart TD Q[Query] --> BM25[BM25 Search\ntop-N] Q --> VEC[Vector Search\ntop-N] BM25 --> F1[Reciprocal Rank Fusion] BM25 --> F2[Weighted Score Sum] BM25 --> F3[Learned Fusion] BM25 --> F4[Cascade] VEC --> F1 VEC --> F2 VEC --> F3 VEC --> F4 F1 --> OUT[Merged Result Set] F2 --> OUT F3 --> OUT F4 --> OUT

1. Reciprocal Rank Fusion (RRF)

RRF is the default choice. It's simple, robust, requires no tuning, and outperforms weighted sum on most benchmarks unless you have a lot of domain-specific calibration data.

def reciprocal_rank_fusion(
    result_lists: list[list[str]],
    k: int = 60,
) -> list[tuple[str, float]]:
    """
    result_lists: each is an ordered list of doc_ids from one retriever.
    k: constant that dampens the effect of high rankings (default 60).
    Returns: [(doc_id, rrf_score)] sorted descending.
    """
    scores: dict[str, float] = {}
    for result_list in result_lists:
        for rank, doc_id in enumerate(result_list, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
 
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)
 
 
# Usage
bm25_results = ["doc_3", "doc_1", "doc_7", "doc_2"]   # BM25 ranking
vec_results  = ["doc_1", "doc_5", "doc_3", "doc_8"]   # Vector ranking
 
fused = reciprocal_rank_fusion([bm25_results, vec_results])
# doc_1 and doc_3 get boosted — they appeared in both lists

RRF works because it's rank-based, not score-based. BM25 scores and cosine similarities live in completely different numerical ranges — you cannot add them directly without normalization. RRF sidesteps this entirely.

2. Weighted Score Sum

When you have calibrated scores and domain-specific tuning data, weighted sum can outperform RRF. The catch: you must normalize scores first.

import numpy as np
from dataclasses import dataclass
 
@dataclass
class ScoredResult:
    doc_id: str
    score: float
 
def normalize_minmax(results: list[ScoredResult]) -> list[ScoredResult]:
    if not results:
        return results
    scores = np.array([r.score for r in results])
    min_s, max_s = scores.min(), scores.max()
    if max_s == min_s:
        return [ScoredResult(r.doc_id, 1.0) for r in results]
    normalized = (scores - min_s) / (max_s - min_s)
    return [ScoredResult(r.doc_id, float(s)) for r, s in zip(results, normalized)]
 
def weighted_fusion(
    bm25_results: list[ScoredResult],
    vec_results: list[ScoredResult],
    alpha: float = 0.5,  # weight for vector; (1-alpha) for BM25
) -> list[ScoredResult]:
    bm25_norm = normalize_minmax(bm25_results)
    vec_norm   = normalize_minmax(vec_results)
 
    combined: dict[str, float] = {}
    for r in bm25_norm:
        combined[r.doc_id] = combined.get(r.doc_id, 0.0) + (1 - alpha) * r.score
    for r in vec_norm:
        combined[r.doc_id] = combined.get(r.doc_id, 0.0) + alpha * r.score
 
    return sorted(
        [ScoredResult(doc_id, score) for doc_id, score in combined.items()],
        key=lambda x: x.score,
        reverse=True,
    )

Choosing alpha: Start at 0.5. Bias toward BM25 (alpha=0.3) for technical queries with precise terminology. Bias toward vector (alpha=0.7) for conversational or conceptual queries. Tune on your eval set — don't guess.

3. Learned Fusion

With enough labeled data (>500 query-relevance pairs), you can train a small model to learn the optimal per-query fusion weight.

# Minimal learned fusion with a linear model
from sklearn.linear_model import LogisticRegression
import numpy as np
 
# Features: [bm25_score_normalized, vec_score_normalized, bm25_rank, vec_rank, query_length]
# Label: 1 if doc is relevant, 0 if not (from human annotation)
 
class LearnedFusion:
    def __init__(self):
        self.model = LogisticRegression()
 
    def fit(self, X: np.ndarray, y: np.ndarray):
        self.model.fit(X, y)
 
    def score(
        self,
        bm25_score: float,
        vec_score: float,
        bm25_rank: int,
        vec_rank: int,
        query_length: int,
    ) -> float:
        features = np.array([[bm25_score, vec_score, bm25_rank, vec_rank, query_length]])
        return float(self.model.predict_proba(features)[0][1])

Learned fusion is worth the complexity only if you have the annotation budget and consistent query distribution. For most teams, RRF is better than a poorly-calibrated learned model.

4. Cascade (Sequential Filtering)

Cascade runs BM25 first as a cheap filter, then runs vector search only over the filtered set. It's not real fusion — it's cost reduction.

Use cascade when:

Your document corpus is millions of records and full vector search is too slow
The query has strong keyword signals (product IDs, error codes) that make BM25 pre-filtering accurate

def cascade_search(
    query: str,
    bm25_index,
    vector_store,
    bm25_top_n: int = 200,
    final_top_k: int = 10,
) -> list[str]:
    # Stage 1: Cheap BM25 pre-filter
    bm25_candidates = bm25_index.search(query, top_n=bm25_top_n)
    candidate_ids = [r.doc_id for r in bm25_candidates]
 
    # Stage 2: Vector re-rank within BM25 candidates
    results = vector_store.similarity_search_in_set(
        query=query,
        doc_ids=candidate_ids,
        top_k=final_top_k,
    )
    return results

When BM25 Alone Is Right

There are cases where adding vector search hurts you. Use BM25 only when:

Corpus is primarily structured data: Part numbers, IDs, codes, model names. Vector semantics don't help and add latency.
Query distribution is narrow and predictable: If 90% of queries are variations of 5 patterns, BM25 with good tokenization and field weighting will match a vector solution.
Latency budget is tight: BM25 at <10ms; vector search with HNSW at 30–200ms depending on index size. If you're at 50ms p99 budget, vector search may blow it.
Index updates are high-frequency: BM25 indexes update in milliseconds. Vector indexes need re-embedding — even with incremental HNSW, updates are slower.

# Quick BM25-only setup with rank_bm25
from rank_bm25 import BM25Okapi
import re
 
def tokenize(text: str) -> list[str]:
    return re.findall(r'\b\w+\b', text.lower())
 
corpus = [
    "ERR_SSL_PROTOCOL_ERROR occurs when TLS handshake fails",
    "Connection timeout after 30 seconds of inactivity",
    "Authentication failed: invalid API key format",
]
 
tokenized_corpus = [tokenize(doc) for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)
 
query = "SSL handshake failure"
scores = bm25.get_scores(tokenize(query))
top_idx = scores.argsort()[::-1][:3]

The Weaviate / Elasticsearch Implementation Reality

Most teams use a managed hybrid search — Weaviate, Elasticsearch, or Pinecone. The abstractions hide the fusion details, which is a trap.

# Weaviate hybrid search — alpha controls BM25 vs vector weight
import weaviate
 
client = weaviate.Client("http://localhost:8080")
 
result = (
    client.query
    .get("Article", ["title", "content"])
    .with_hybrid(
        query="SSL certificate error",
        alpha=0.5,           # 0 = pure BM25, 1 = pure vector
        fusion_type=weaviate.HybridFusion.RELATIVE_SCORE,  # or RANKED
    )
    .with_limit(10)
    .do()
)

Weaviate's RANKED fusion is RRF. RELATIVE_SCORE is normalized weighted sum. Know which one you're using. Test both on your eval set. The default is RANKED (RRF) — start there.

Diagnosing Hybrid Search Problems

When hybrid search underperforms a single modality, the fusion is usually the problem, not the retrievers.

Symptom: Results are worse than BM25 alone.

Vector results may be polluting — run with alpha=0 to confirm BM25 baseline.
Your embedding model may be poorly calibrated for this domain.
Check if semantic neighbors are actually semantically related using a similarity sanity test.

Symptom: Results are worse than vector alone.

BM25 is returning off-topic keyword matches that dilute the fused set.
Your BM25 tokenization doesn't match your query patterns.
Try stemming or domain-specific tokenization.

Symptom: Latency increased without quality improvement.

You're running full vector search on a large index when BM25-only would be fine.
Consider cascade: BM25 pre-filter → vector re-rank on candidates.

Key Takeaways

Use RRF as your default fusion — it requires no score normalization and outperforms weighted sum without tuning.
BM25 and vector scores cannot be added directly — they live in different ranges; normalize first if doing weighted sum.
BM25-only is the right choice for high-frequency index updates, tight latency budgets, or primarily structured query patterns.
Cascade search (BM25 filter → vector re-rank) cuts cost on large corpora without sacrificing quality.
Tune alpha on your actual eval set — 0.5 is a starting point, not an answer.
Know which fusion algorithm your managed search provider uses under the hood; defaults are not always what you think.