Vol. 2 No. 3 (2018): JABADP-2-3
Articles

Query-Aware Caching for Multi-Tenant Vector Search with SLA-Driven Eviction and Fairness Guarantees

Quang Nguyen
Department of Computer Science and Engineering, Mekong Institute of Technology, Đường Hoa Phượng 12, Cần Thơ, Vietnam
Khai Vo
Department of Computer Science and Engineering, Red River University of Computing, Đường Lê Quý Đôn 88, Hà Nội, Vietnam

Published 2018-03-04

How to Cite

Nguyen, Q., & Vo, K. (2018). Query-Aware Caching for Multi-Tenant Vector Search with SLA-Driven Eviction and Fairness Guarantees. Journal of Applied Big Data Analytics, Decision-Making, and Predictive Modelling Systems, 2(3), 1-17. https://polarpublications.com/index.php/JABADP/article/view/2018-03-04

Abstract

Multi-tenant vector search systems increasingly serve heterogeneous applications such as retrieval-augmented generation, recommendation, and deduplication, where each tenant issues embedding-based similarity queries over shared infrastructure. These workloads exhibit burstiness, temporal locality in query intent, and heavy-tailed latency sensitivity, while operators must satisfy per-tenant service-level agreements under tight memory and compute budgets. Caching is a natural lever, yet conventional policies optimize global hit rate or average latency and often ignore how approximate nearest-neighbor execution paths depend on the query embedding, index structure, and tenant-specific objectives. This paper studies query-aware caching for multi-tenant vector search with eviction driven by explicit SLA risk and fairness guarantees. We formalize cacheable objects beyond raw results, including centroid routes, candidate lists, graph neighborhoods, and quantization side data that reduce compute along typical ANN pipelines. We introduce a utility model that predicts marginal reductions in tail latency and SLA violation probability as a function of query similarity, routing state, and resource contention. Eviction is posed as an online constrained optimization problem balancing memory, energy, and tail latency, while enforcing tenant fairness via proportional or max-min style constraints. We develop practical algorithms that combine sketch-based query clustering, low-rank tenant intent models, and primal-dual updates that track shadow prices for fairness and memory. We also discuss performance engineering and distributed execution considerations, and outline an evaluation methodology emphasizing tail metrics, reproducibility, and robustness under tenant churn and workload shifts.