In a recent development that is making waves in the world of artificial intelligence, NVIDIA’s Blackwell platform has achieved significant milestones in the MLPerf Inference V5.0 benchmarks. This achievement is noteworthy as it represents NVIDIA’s inaugural submission using the NVIDIA GB200 NVL72 system – a robust, rack-scale solution meticulously crafted for advanced AI reasoning.
The modern era of AI innovation demands a new type of computational infrastructure, aptly termed “AI factories.” These facilities transcend the traditional roles of data centers, which primarily focus on data storage and processing. Instead, AI factories are engineered to generate intelligence on a grand scale by converting raw data into actionable, real-time insights. The primary objective of these AI factories is clear-cut: provide users with accurate and rapid responses to their queries, all while minimizing costs and maximizing user reach.
Behind the scenes, the complexity of these operations is immense. As AI models expand, reaching billions and trillions of parameters, the computational power needed to generate each “token” – a piece of processed data – increases. This surge in demand inversely affects the number of tokens an AI factory can produce while simultaneously driving up the cost per token. To maintain high inference throughput and reduce token costs, innovation must occur swiftly across every layer of the technology stack, from silicon and network systems to software.
The latest updates to the MLPerf Inference benchmarks now include Llama 3.1 405B, one of the largest and most challenging models to run with open weights. Additionally, the new Llama 2 70B Interactive benchmark introduces stricter latency requirements compared to its predecessor, aligning more closely with the demands of production environments to enhance user experience.
Beyond the Blackwell platform, NVIDIA’s Hopper platform has also demonstrated remarkable performance enhancements. Over the past year, performance on the Llama 2 70B benchmark has seen significant improvements, thanks to comprehensive optimizations throughout the technology stack.
### NVIDIA Blackwell: A New Benchmark Standard
The NVIDIA GB200 NVL72 system, a powerhouse connecting 72 NVIDIA Blackwell GPUs to function as a singular, colossal GPU, has achieved up to 30 times higher throughput on the Llama 3.1 405B benchmark compared to the previous NVIDIA H200 NVL8 submission. This remarkable feat was accomplished by more than tripling the performance per GPU and expanding the NVIDIA NVLink interconnect domain by ninefold.
While numerous companies utilize MLPerf benchmarks to assess hardware performance, only NVIDIA and its collaborators have submitted and published results for the Llama 3.1 405B benchmark. This highlights NVIDIA’s commitment to pushing the envelope in AI performance.
In production inference deployments, latency constraints are critical and affect two main metrics. The first is “time to first token” (TTFT), which measures how quickly a user starts seeing responses to their queries when interacting with large language models. The second metric is “time per output token” (TPOT), which assesses how rapidly tokens are delivered to users.
The new Llama 2 70B Interactive benchmark introduces a fivefold reduction in TPOT and a 4.4 times decrease in TTFT, fostering a more responsive user experience. NVIDIA’s submission using an NVIDIA DGX B200 system equipped with eight Blackwell GPUs has tripled its performance compared to systems using eight NVIDIA H200 GPUs. This sets a high standard for the more demanding version of the Llama 2 70B benchmark.
By combining the Blackwell architecture with its optimized software stack, NVIDIA is ushering in a new era of inference performance. This development paves the way for AI factories to deliver heightened intelligence, increased throughput, and accelerated token rates, thereby enhancing overall efficiency.
### The Increasing Value of NVIDIA Hopper AI Factories
Introduced in 2022, the NVIDIA Hopper architecture has become a cornerstone for many of today’s AI inference factories and continues to play a crucial role in model training. Through relentless software optimization, NVIDIA has managed to bolster the throughput of Hopper-based AI factories, thereby enhancing their value.
On the Llama 2 70B benchmark, first introduced a year ago as part of MLPerf Inference v4.0, the throughput of H100 GPUs has increased by 1.5 times. The H200 GPU, which is built on the same Hopper architecture but boasts larger and faster memory, extends this improvement to 1.6 times.
The Hopper architecture has been instrumental in running every benchmark, including the newly incorporated Llama 3.1 405B, Llama 2 70B Interactive, and graph neural network tests. This versatility ensures that Hopper can handle a broad array of workloads and keep pace with the evolving challenges of expanding models and usage scenarios.
### A Collaborative Ecosystem
In the latest MLPerf round, 15 partners, including notable names such as ASUS, Cisco, CoreWeave, Dell Technologies, Fujitsu, and Google Cloud, among others, submitted impressive results on the NVIDIA platform. This wide array of submissions underscores the extensive reach of the NVIDIA platform, which is accessible through all major cloud service providers and server manufacturers globally.
The ongoing efforts by MLCommons to evolve the MLPerf Inference benchmark suite ensure that it keeps up with the latest advancements in AI, providing the ecosystem with rigorous, peer-reviewed performance data. This is crucial in aiding IT decision-makers to select the most optimal AI infrastructure.
For those interested in delving deeper into the specifics of MLPerf, further information is available on NVIDIA’s official resources.
This remarkable achievement and the collaborative efforts behind it illustrate the dynamic and evolving landscape of AI technology, where innovation is continuously pushing the boundaries of what is possible.
Image and video documentation for this development were captured at an Equinix data center in Silicon Valley.
For more Information, Refer to this article.