Optimize retrieval latency using semantic caching for repeated queries, KV-cache pre-fetching for overlapping contexts, and index partitioning across CPU/GPU with HNSW to balance accuracy and speed.
Production retrieval latency optimization involves three complementary strategies. First, semantic caching stores computed embeddings for frequent queries, avoiding redundant computation. Second, KV-cache pre-fetching anticipates likely next queries based on conversation history and preloads relevant key-value representations. Third, index partitioning splits vector search across CPU and GPU resources: hot data on GPU for low latency, cold data on CPU for cost efficiency. HNSW indexing provides logarithmic search complexity, while techniques like head+tail truncation ensure token budgets stay within limits.
Semantic caching: Cache embeddings for repeated queries with configurable TTL and similarity thresholds
KV-cache pre-fetching: Preload key-value representations for likely next queries based on conversation state
Index partitioning: Place hot data on GPU for low latency, cold data on CPU for cost efficiency
HNSW indexing: Logarithmic search complexity with configurable ef_search for recall-speed trade-off
Head+tail truncation: Preserve beginning and end of long documents while discarding middle sections