Query-Aware Caching for Multi-Tenant Vector Search with SLA-Driven Eviction and Fairness Guarantees
Published 2018-03-04
Copyright (c) 2018 authors

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
How to Cite
Abstract
Multi-tenant vector search systems increasingly serve heterogeneous applications such as retrieval-augmented generation, recommendation, and deduplication, where each tenant issues embedding-based similarity queries over shared infrastructure. These workloads exhibit burstiness, temporal locality in query intent, and heavy-tailed latency sensitivity, while operators must satisfy per-tenant service-level agreements under tight memory and compute budgets. Caching is a natural lever, yet conventional policies optimize global hit rate or average latency and often ignore how approximate nearest-neighbor execution paths depend on the query embedding, index structure, and tenant-specific objectives. This paper studies query-aware caching for multi-tenant vector search with eviction driven by explicit SLA risk and fairness guarantees. We formalize cacheable objects beyond raw results, including centroid routes, candidate lists, graph neighborhoods, and quantization side data that reduce compute along typical ANN pipelines. We introduce a utility model that predicts marginal reductions in tail latency and SLA violation probability as a function of query similarity, routing state, and resource contention. Eviction is posed as an online constrained optimization problem balancing memory, energy, and tail latency, while enforcing tenant fairness via proportional or max-min style constraints. We develop practical algorithms that combine sketch-based query clustering, low-rank tenant intent models, and primal-dual updates that track shadow prices for fairness and memory. We also discuss performance engineering and distributed execution considerations, and outline an evaluation methodology emphasizing tail metrics, reproducibility, and robustness under tenant churn and workload shifts.