Q-Filters: Leveraging QK geometry for efficient KV cache compression
By Miniml Research, August 12, 2025
KV cache size is a major bottleneck for long-context inference. Q-Filters address this with a training-free method that filters KV pairs using a context-agnostic projection derived from QK geometry.
Because the method is training-free and compatible with FlashAttention, it can be adopted without retooling the model or retraining. The paper reports strong quality retention, including 32x compression with 99% accuracy on needle-in-a-haystack tests.
Q-Filters are a practical option for teams that need long-context throughput without sacrificing reliability.
Paper: https://arxiv.org/abs/2503.02812
Stay ahead with research-backed solutions
From papers to production, we translate cutting-edge AI research into practical systems that give your business a competitive edge.
See how we work