Q-Filters: Leveraging QK geometry for efficient KV cache compression

By Miniml Research, August 12, 2025

KV cache size is a major bottleneck for long-context inference. Q-Filters address this with a training-free method that filters KV pairs using a context-agnostic projection derived from QK geometry.

Because the method is training-free and compatible with FlashAttention, it can be adopted without retooling the model or retraining. The paper reports strong quality retention, including 32x compression with 99% accuracy on needle-in-a-haystack tests.

Q-Filters are a practical option for teams that need long-context throughput without sacrificing reliability.

Paper: https://arxiv.org/abs/2503.02812

Stay ahead with research-backed solutions

From papers to production, we translate cutting-edge AI research into practical systems that give your business a competitive edge.

See how we work