NVIDIA to Unveil Data Center Performance and Efficiency Boosts at Hot Chips

NVIDIA to Showcase Cutting-Edge Innovations at Hot Chips 2024

The annual Hot Chips conference, a premier event for processor and system architects from both industry and academia, has emerged as a pivotal forum for addressing the needs of the trillion-dollar data center computing market. Scheduled to take place from August 25-27 at Stanford University and online, Hot Chips 2024 promises to be a landmark event, with NVIDIA taking center stage to present its groundbreaking advancements in data center technology.

NVIDIA’s Blackwell Platform: A Leap Forward in AI Computing

At the forefront of NVIDIA’s presentations will be the unveiling of the NVIDIA Blackwell platform. Senior engineers from NVIDIA will provide an in-depth look at this state-of-the-art technology, which integrates multiple chips, systems, and the NVIDIA CUDA software to drive the next generation of artificial intelligence (AI) across various industries and applications worldwide.

The Blackwell platform is a comprehensive solution that incorporates several advanced NVIDIA components, including the Blackwell GPU, Grace CPU, BlueField data processing unit, ConnectX network interface card, NVLink Switch, Spectrum Ethernet switch, and Quantum InfiniBand switch. This full-stack computing challenge aims to push the boundaries of AI and accelerated computing performance while significantly enhancing energy efficiency.

Revolutionizing AI System Design with NVIDIA GB200 NVL72

One of the standout features of the Blackwell platform is the NVIDIA GB200 NVL72, a multi-node, liquid-cooled, rack-scale solution that connects 72 Blackwell GPUs and 36 Grace CPUs. This innovative system is designed to meet the demanding requirements of AI system design, offering unprecedented performance and efficiency.

The GB200 NVL72 acts as a unified system to deliver up to 30 times faster inference for large language model (LLM) workloads. This capability is crucial for running trillion-parameter models in real-time, making it a game-changer for AI research and applications.

NVLink: Enabling High-Performance GPU Communication

A key component of the Blackwell platform is the NVLink interconnect technology, which facilitates all-to-all GPU communication. This technology enables record high throughput and low-latency inference for generative AI, making it possible to achieve new levels of performance and efficiency in AI computing.

The NVIDIA Quasar Quantization System

Another highlight of the Blackwell platform is the NVIDIA Quasar Quantization System. This system brings together algorithmic innovations, NVIDIA software libraries, and tools, along with Blackwell’s second-generation Transformer Engine, to support high accuracy on low-precision models. This capability is particularly beneficial for applications involving LLMs and visual generative AI.

Liquid Cooling: The Future of Data Center Efficiency

As data centers continue to evolve, the need for more efficient and sustainable cooling solutions becomes increasingly important. Traditional air-cooled data centers are gradually being replaced by hybrid cooling systems that combine air and liquid cooling for optimal performance.

Liquid cooling techniques are more effective at dissipating heat than air cooling, allowing systems to stay cool even under heavy workloads. This method also reduces the physical footprint and power consumption of cooling equipment, enabling data centers to accommodate more server racks and, consequently, more computing power.

Ali Heydari, director of data center cooling and infrastructure at NVIDIA, will present several designs for hybrid-cooled data centers. These designs range from retrofitting existing air-cooled data centers with liquid-cooling units to installing piping for direct-to-chip liquid cooling and fully submerging servers in immersion cooling tanks. Although these options require a larger initial investment, they offer significant long-term savings in energy consumption and operational costs.

Heydari’s team is also involved in the COOLERCHIPS program, a U.S. Department of Energy initiative to develop advanced data center cooling technologies. Using the NVIDIA Omniverse platform, the team creates physics-informed digital twins to model energy consumption and cooling efficiency, optimizing data center designs for maximum performance and sustainability.

AI Models: Enhancing Processor Design

Designing cutting-edge processors is a complex task that requires fitting as much computing power as possible onto a small piece of silicon. AI models are playing a crucial role in this process by improving design quality and productivity, automating time-consuming tasks, and assisting with optimization and prediction.

Mark Ren, director of design automation research at NVIDIA, will provide an overview of these AI models and their applications in processor design. He will also focus on agent-based AI systems that can autonomously complete tasks, offering broad applications across various industries.

In microprocessor design, NVIDIA researchers are developing agent-based systems that can reason and take action using customized circuit design tools. These systems can interact with experienced designers, learn from a database of human and agent experiences, and significantly enhance the efficiency and accuracy of the design process.

Ren will share examples of how engineers can use AI agents for tasks such as timing report analysis, cell cluster optimization, and code generation. Notably, the cell cluster optimization work recently won the best paper award at the first IEEE International Workshop on LLM-Aided Design.

Conclusion: A Glimpse into the Future of Data Center Computing

The presentations at Hot Chips 2024 will showcase the innovative ways in which NVIDIA engineers are pushing the boundaries of data center computing and design. From the advanced capabilities of the Blackwell platform to the efficiency of hybrid cooling solutions and the transformative potential of AI models in processor design, NVIDIA is setting new standards for performance, efficiency, and optimization in the industry.

These advancements not only promise to revolutionize AI computing but also pave the way for more sustainable and efficient data centers, addressing the growing demands of the digital age. As we look forward to Hot Chips 2024, it is clear that NVIDIA’s contributions will play a significant role in shaping the future of data center technology.

For those interested in learning more about these groundbreaking developments, registration for Hot Chips 2024 is available online and at Stanford University. Don’t miss the opportunity to gain insights from leading experts and witness the future of data center computing unfold.

—

References:

NVIDIA Blackwell Platform: NVIDIA Blackwell
NVIDIA GB200 NVL72: NVIDIA GB200 NVL72
NVLink Technology: NVLink
Grace CPU: Grace CPU
Hybrid Cooling and COOLERCHIPS Program: COOLERCHIPS
NVIDIA Omniverse: NVIDIA Omniverse
IEEE International Workshop on LLM-Aided Design: IEEE International Workshop on LLM-Aided Design
Stay tuned to our blog for more updates on the latest in tech innovations and advancements.

For more Information, Refer to this article.