In a significant development, Cerebras Systems has announced a major update to its Cerebras Inference platform, marking the most substantial enhancement since its inception. This update enables Cerebras Inference to run the Llama 3.1-70B model at an impressive speed of 2,100 tokens per second, which represents a threefold increase in performance compared to previous versions. To put this into perspective, this advancement makes Cerebras Inference:
– 16 times faster than the most efficient GPU solutions available.
– 8 times quicker than GPUs running the smaller Llama 3.1-3B model, which is 23 times smaller in size.
– Comparable to a full generation upgrade in GPU hardware performance, such as moving from the H100 to A100 models, achieved through a single software update.
The significance of fast inference cannot be overstated, as it is the cornerstone for the development of the next generation of artificial intelligence (AI) applications. Whether it’s voice recognition, video processing, or complex problem-solving, rapid inference is the key to creating responsive and intelligent applications that were previously unattainable. Industry pioneers like Tavus, who are revolutionizing video generation, and GSK, who are enhancing drug discovery processes, are already leveraging Cerebras Inference to push the limits of innovation. To experience the capabilities of Cerebras Inference firsthand, users can access it via chat or API at inference.cerebras.ai.
### Benchmarking the Performance
The performance claims of Cerebras Inference have been rigorously evaluated by Artificial Analysis, an independent benchmarking organization. The results of these evaluations are impressive. In terms of output speed per user, Cerebras Inference stands out as:
– 16 times faster than even the most optimized GPU solutions.
– 68 times quicker than large-scale cloud platforms.
– 4 to 8 times faster than other AI accelerators.
One of the key metrics for real-time applications is the time to the first token. Cerebras Inference ranks second in this area, demonstrating the advantages of a wafer-scale integrated solution over more complex, networked alternatives. Additionally, the total response time—which measures the complete cycle of input and output—is a crucial factor for multi-step workflows. In this regard, Cerebras Inference completes a request in just 0.4 seconds, compared to the 1.1 to 4.2 seconds needed by GPU-based solutions. For AI agents, this translates to accomplishing up to ten times more work within the same time frame, and for reasoning models, it allows for ten times more reasoning steps without extending the response time.
Cerebras Inference running the Llama 3.1-70B model is so fast that it surpasses the speed of GPU-based systems running the much smaller Llama 3.1-3B model. The Wafer Scale Engine executes an AI model 23 times larger at 8 times the speed, resulting in a combined performance advantage of 184 times.
### Enhancements in Kernels, Stack, and Machine Learning
The initial release of Cerebras Inference in August set new records for speed and made the Llama 3.1-70B model an instantaneous experience. While it was already exceptionally fast, it utilized only a fraction of the Wafer Scale Engine’s peak bandwidth, computing power, and input/output capacity. The latest release is the culmination of numerous improvements in software, hardware, and machine learning (ML), enhancing both the utilization and real-world performance of Cerebras Inference.
Several critical kernels have been rewritten or optimized, including those for matrix multiplication (MatMul), reduce/broadcast operations, element-wise operations, and activations. The input/output processes have been streamlined to operate asynchronously from the computing processes. Additionally, this release incorporates speculative decoding, a widely used technique that employs a small model in conjunction with a large model to generate responses more quickly. Consequently, users may notice a greater variance in output speed—fluctuating by about 20% faster or slower than the average of 2,100 tokens per second is considered normal.
Despite these enhancements, model precision remains unchanged, with all models continuing to use 16-bit original weights. Output accuracy has also been verified as consistent by Artificial Analysis.
### The Power of Fast Inference
The unparalleled speed of Cerebras Inference is already transforming how organizations develop and deploy AI applications. In the pharmaceutical industry, Kim Branson, Senior Vice President of AI and Machine Learning at GSK, remarked, “With the speed of Cerebras Inference, GSK is developing innovative AI applications, such as intelligent research agents, that will fundamentally enhance the productivity of our researchers and the drug discovery process.”
The dramatic improvement in speed is a game changer for real-time AI applications, exemplified by LiveKit, which powers ChatGPT’s voice mode. As CEO Russ d’Sa notes, “When building voice AI, inference is typically the slowest step in the pipeline. With Cerebras Inference, it’s now the fastest. A complete pass through a pipeline that includes cloud-based speech-to-text, 70-billion-parameter inference using Cerebras Inference, and text-to-speech, is faster than just the inference step alone on other providers. This is a groundbreaking development for developers building voice AI that can respond with human-level speed and precision.”
Fast inference is a critical enabler for next-generation AI applications that utilize greater test-time computing power for enhanced model capabilities. As demonstrated by models like GPT-o1, the ability to engage in extensive chain-of-thought reasoning directly translates to significant improvements in reasoning, coding, and research tasks. With Cerebras Inference, models are able to think more deeply before responding, without the usual delay of several minutes. This makes Cerebras Inference an ideal platform for developers aiming to build systems that offer both enhanced runtime intelligence and responsive user experiences.
### Conclusion
The latest update, with its threefold performance improvement, showcases the potential of the Wafer Scale Engine for inference tasks. Delivering 2,100 tokens per second for the Llama 3.1-70B model, Cerebras has achieved a performance leap equivalent to more than a full generation of hardware advancements in a single software release. The team continues to work on optimizing both software and hardware capabilities, with plans to expand model selection, context lengths, and API features in the near future. For more information, you can visit the Cerebras website.
For more Information, Refer to this article.