Article· April 3, 2026

TurboQuant in .NET: When Compressing Embeddings Actually Makes Sense

We built and open-sourced a .NET implementation of TurboQuant, a near-optimal vector quantization algorithm. When it makes sense, when it doesn't, and what we learned working through the trade-offs with a real client.

Outmatic

Engineering Team

When working with semantic search at scale, sooner or later the same problem shows up: memory. Not the model's memory, but the vectors it produces.

A 768-dimension embedding in float32 takes about 3KB. Multiply that by a million documents and you're looking at 3GB just for the vector index. At tens of millions, infrastructure costs start outweighing the model itself.

TurboQuant by Google

In April 2025, researchers from Google Research, NYU, and Google DeepMind published a paper introducing TurboQuant, later presented at ICLR 2026. The core idea isn't new: reduce the numerical precision of vectors to save memory. What's new is that TurboQuant gets close to the information-theoretic lower bound, with zero calibration data and no metadata overhead.

In practice: 4-bit instead of float32, near-identical quality on modern embedding models, 8x memory reduction.

Diagram showing float32 to 4-bit vector compression with 8x memory reduction

The ecosystem responded quickly with implementations in Python, Rust, and C. Unfortunately in .NET, nothing.

What we explored

Working with a client using Vespa as a search engine with E5-768 embeddings, we asked ourselves whether it was worth bringing TurboQuant to C#. The short answer: it depends.

For that specific case, no. E5 models show measurable quality loss with aggressive quantization, and Vespa natively supports bfloat16, which halves memory with zero additional complexity. One line in the schema, no new code.

But the question led us to think about when TurboQuant actually makes sense in a .NET architecture:

It makes sense when you keep a vector index in-process: feature stores, semantic caches, batch deduplication over tens of millions of documents. In these scenarios there's no Vespa or external vector store handling compression for you.

It doesn't make sense when a system like Vespa already manages persistence and search. It adds complexity with no tangible benefit.

So we built it

There was no C# implementation of TurboQuant. For anyone working with .NET at vector scale, that meant either bridging to Python (introducing latency and operational complexity) or giving up on the technique entirely.

We built and open-sourced TurboQuant for .NET: a paper-correct implementation of Algorithm 1, published on NuGet.

What's inside:

Paper-correct algorithm: random orthogonal rotation via QR decomposition (default), with optional Hadamard rotation for power-of-2 dimensions
Lloyd-Max codebook computed on the exact Beta(d/2, 1/2) distribution, not a Gaussian approximation
Real bit-packing (2/3/4-bit) with zero-allocation quantization via ArrayPool and Span<T>
LUT-based approximate similarity for 4-bit: cosine similarity directly on packed bytes, zero heap allocation
KV cache compression with asymmetric quantization (e.g. keys at 4-bit, values at 2-bit) and a residual window that keeps recent tokens in full precision
Serialization built into PackedVector for storage and transport
SIMD acceleration via System.Runtime.Intrinsics (AVX2, SSE2, NEON)
.NET 8 and 10, AOT and trimming compatible, zero external dependencies

TurboQuant pipeline: Input Vector, Orthogonal Rotation, Lloyd-Max Codebook, Bit Packing, Packed Output

We validated the implementation against the paper's theoretical bounds. D_mse matches the Lloyd-Max optimal distortion within 1%, and cosine similarity stays above 0.995 at 4-bit across all tested dimensions. The test suite covers 125 cases including paper validation, end-to-end scenarios, thread safety, and robustness edge cases.

What we learned

The most useful takeaway isn't technical. It's methodological: before implementing an optimization technique, it's worth understanding whether the problem it solves is your actual problem.

In our client's case, the answer was native bfloat16: available, free, already in the system. TurboQuant would have been the right answer to a slightly different question.

This kind of reasoning, starting from real context before choosing the technology, is what we try to bring to every project.

Outmatic

AIgmented Teams™

Solutions

TurboQuant in .NET: When Compressing Embeddings Actually Makes Sense

TurboQuant by Google

What we explored

So we built it

What we learned

TurboQuant in .NET: When Compressing Embeddings Actually Makes Sense

TurboQuant by Google

What we explored

So we built it

What we learned