Hybrid Search
Combine vector and text search with RRF
Hybrid search combines vector similarity with text matching using Reciprocal Rank Fusion (RRF). This improves results for queries containing exact identifiers or keywords.
How It Works
Query: "handleAuth authentication"
│
├─► Vector Search ──► [chunk_a, chunk_b, chunk_c]
│ rank 0 rank 1 rank 2
│
└─► Text Search ────► [chunk_b, chunk_d, chunk_a]
rank 0 rank 1 rank 2
│
▼
RRF Fusion: score = Σ 1/(k + rank)
│
▼
[chunk_b, chunk_a, chunk_c, chunk_d]
(appears in both lists = higher score)
- Vector search: Semantic similarity via embeddings (existing behavior)
- Text search: Simple keyword matching in chunk content
- RRF fusion: Combines rankings from both sources
Configuration
Disabled by default. Enable in .grepai/config.yaml:
search:
hybrid:
enabled: true
k: 60 # RRF constant (default: 60)
The k parameter
The RRF formula is: score(doc) = Σ 1/(k + rank_i)
- Higher k (e.g., 100): More weight to documents appearing in multiple lists
- Lower k (e.g., 30): More weight to top-ranked documents in each list
- Default (60): Balanced weighting
When to Enable
Enable hybrid search when:
- Queries often include exact function/class names
- You mix natural language with identifiers (e.g., “handleAuth function”)
- Vector-only search misses obvious keyword matches
- You search for specific variable or method names
Examples
| Query | Vector-only | Hybrid |
|---|---|---|
handleUserLogin | May miss if embedding doesn’t capture identifier | Finds exact matches |
authentication flow | Good semantic match | Same quality |
validateEmail function in user module | Partial match | Better: combines semantic + keyword |
Performance Note
Hybrid search loads all chunks into memory for text matching. For very large indexes (100k+ chunks), this may add latency.
Consider keeping it disabled for:
- Very large monorepos
- Purely semantic queries (no identifiers)
- Performance-critical use cases
Technical Details
Text Search
Simple keyword matching:
- Query is tokenized into words (lowercase, min 2 chars)
- Each chunk is scored by:
matches / total_words - Results sorted by score
RRF Fusion
Reciprocal Rank Fusion merges ranked lists:
score(doc) = Σ 1/(k + rank_i) for each source
Benefits:
- No need to normalize scores between sources
- Robust to outliers
- Simple and effective