TurboQuant - Inward App

TurboQuant

New LLM compression algorithm by Google

Website research.google

What it is

A set of advanced theoretically grounded quantization algorithms that enable massive compression for large language models and vector search engines.

Intent

I need it when

Benchmark or research state-of-the-art quantization algorithms for LLMs and vector databases

TurboQuant, QJL, and PolarQuant are peer-reviewed algorithms (ICLR 2026, AISTATS 2026) evaluated on standard benchmarks (LongBench, RULER, ZeroSCROLLS, L-Eval) using open-source LLMs (Gemma, Mistral, Llama-3.1), providing reproducible baselines for academic and applied ML research.

Reduce LLM inference memory usage by compressing the key-value cache without retraining the model

TurboQuant compresses KV cache entries to as low as 3 bits with zero accuracy loss and no fine-tuning required, reducing KV memory by at least 6x on benchmarks like LongBench and Needle In A Haystack — directly cutting memory costs and enabling longer context windows.

Apply theoretically grounded quantization that eliminates memory overhead from quantization constants

Traditional quantization adds 1–2 extra bits per number for storing quantization constants. TurboQuant's PolarQuant and QJL components eliminate this overhead entirely — PolarQuant via polar coordinate mapping and QJL via a zero-overhead 1-bit sign representation — delivering cleaner compression ratios.

Improve recall and speed of high-dimensional vector search (approximate nearest neighbor) at scale

TurboQuant achieves superior 1@k recall ratios on datasets like GloVe (d=200) compared to state-of-the-art baselines (PQ, RabbiQ) without dataset-specific tuning or large codebooks, and dramatically speeds up index building — making it suitable for large-scale AI search engines.

Speed up attention computation in large language models running on GPU infrastructure

4-bit TurboQuant achieves up to 8x speedup in computing attention logits over 32-bit unquantized keys on H100 GPUs with negligible runtime overhead, making it a drop-in acceleration technique for transformer inference pipelines.

Drop

Not a fit when

You need a commercially licensed, production-ready software package with vendor support — TurboQuant is a research algorithm with no known commercial release.
Your model requires fine-tuning or training-based quantization — TurboQuant is explicitly training-free and may not suit workflows that depend on quantization-aware training.
You are working with low-dimensional vectors — TurboQuant is designed for high-dimensional vector compression and KV cache scenarios.
You need quantization below 3 bits with no accuracy loss — the paper demonstrates lossless compression at 3-bit minimum; sub-3-bit use cases are not evidenced.
Your hardware is not GPU-accelerated (e.g., H100) — the demonstrated 8x speedup is benchmarked on H100 GPUs; gains on other hardware are not evidenced.
You require dataset-specific tuning or large codebooks for maximum recall — TurboQuant is data-oblivious and does not use dataset-specific codebooks, which may be a limitation for highly specialized retrieval tasks.

Commercials

Pricing

No pricing information available. TurboQuant is a research algorithm published by Google Research (to be presented at ICLR 2026); no commercial product or pricing page exists.