News

Nvidia triples Llama 3 70B inference performance through software stack optimizations

Tuesday, March 10, 2026 at 05:12 PM

Nvidia has achieved a 3x increase in inference performance for Llama 3 70B within 60 days by optimizing software stacks including TRTLLM, MNNVL MoE dispatch, and Dynamo, significantly improving infrastructure efficiency.

Context

In a major software-driven performance leap, Nvidia has tripled the inference performance of Meta’s Llama 3 70B model. This optimization was achieved in less than 60 days through critical updates to the TensorRT-LLM stack, incorporating advanced techniques like MNNVL MoE Dispatch, optimized collective communications, and the Nvidia Dynamo inference framework. These improvements allow existing hardware to process significantly more tokens per second, effectively lowering the total cost of ownership for AI researchers and enterprise developers. This development underscores the importance of software-hardware co-design in the AI supply chain. By utilizing speculative decoding and FP8 quantization, the Nvidia HGX H200 platform demonstrated throughput speedups of up to 3.55x for Llama models. For investors, this highlights Nvidia’s ability to extract massive value from current-generation Hopper and next-generation Blackwell architectures without requiring hardware replacements, reinforcing its competitive moat against emerging ASIC and GPU rivals.

Sources (10)

Introducing Meta Llama 3: The most capable openly available LLM to date llama-3.1-70b-instruct Model by Meta - NVIDIA NIM APIs Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding | NVIDIA Technical Blog Performance — NVIDIA NIM LLMs Benchmarking Wide Open: NVIDIA Accelerates Inference on Meta Llama 3 | NVIDIA Blog InferenceX v2: NVIDIA Blackwell Vs AMD vs Hopper - Formerly InferenceMAX Positron AI says its Atlas accelerator beats Nvidia H200 on inference in just 33% of the power — delivers 280 tokens per second per user with Llama 3.1 8B in 2000W envelope | Tom's Hardware Testing Llama 3.3 70B inference performance on NVIDIA GH200 in Lambda Cloud

Related Companies

Nvidia

NVDA