News

Nvidia likely to offload reasoning model KV cache to high-speed SSDs due to SRAM capacity limits

Thursday, December 11, 2025 at 10:17 PM

As reasoning models scale to context lengths exceeding 1 million tokens, storing KV cache on SRAM becomes impractical due to memory density limitations. DeepSeek V3 data indicates that at a 100 million context length, the KV cache would require 3.2 TB of storage. Industry shifts may lead Nvidia to offload this data to high-speed SSDs capable of 100M IOPS, while Groq architecture remains competitive for real-time voice latency requirements.

Context

Nvidia is reportedly shifting its hardware strategy to offload the massive Key-Value (KV) cache of long-context reasoning models from limited SRAM to high-speed SSDs. As next-generation models push context lengths toward 1 million tokens and beyond, the cache footprint far exceeds the capacity of current on-chip memory. For example, DeepSeek V3 requires 34.3 KB of storage per token; at a 100 million context length, the KV cache alone balloons to 3.2 terabytes, making traditional SRAM-only storage both economically and physically unfeasible. This pivot underscores a critical evolution in the AI supply chain where ultra-fast storage becomes the primary bottleneck for inference. Nvidia is expected to utilize SSDs with 100 million IOPS to bridge this memory gap. While specialized, high-speed architectures like those from Groq remain superior for real-time, low-latency tasks like voice, Nvidia’s integration of high-capacity storage targets the burgeoning market for long-form reasoning. This move signals that future data center demand will increasingly hinge on massive storage throughput to support "deep-thinking" AI models.

Related Companies

Nvidia

NVDA

SK Hynix

000660