Meta Launches Llama API, Achieving 18x Speed Increase Over OpenAI: Cerebras Collaboration Yields 2,600 Tokens Per Second

Meta Partners with Cerebras to Launch Llama API
Introduction to the Partnership
Meta has recently teamed up with Cerebras Systems to launch a new product called the Llama API. This collaboration was announced during Meta’s inaugural LlamaCon developer conference held in Menlo Park, California. With this new API, developers will have access to inference speeds that can be as much as 18 times faster than traditional GPU-based solutions.
Competing in the AI Market
This partnership positions Meta to compete directly with major players in the AI market, such as OpenAI, Anthropic, and Google. As AI inference services grow, developers are eager to purchase tokens in huge quantities to power their applications. The ability to offer such rapid inference is crucial in this competitive landscape.
Quotes from Industry Leaders
Julie Shin Choi, the Chief Marketing Officer at Cerebras, expressed excitement about this strategic partnership. She stated, "Meta has selected Cerebras to collaborate to deliver the ultra-fast inference that they need to serve developers through their new Llama API." Additionally, James Wang, a senior executive at Cerebras, noted the transformation of Meta’s Llama models into a commercial service, marking a significant shift in their business strategy.
The Significance of Llama Model Downloads
Meta’s Llama models have seen impressive engagement, boasting over one billion downloads. Until this partnership, however, Meta lacked a first-party cloud infrastructure for developers. This move not only solidifies their role as AI model developers but also brings a new revenue stream through AI computation.
Speed Advantages: Breaking Performance Barriers
One of the standout features of the Llama API is the speed at which it operates, enabled by Cerebras’ specialized AI chips. The system reportedly delivers over 2,600 tokens per second with the Llama 4 Scout model, far surpassing competitors like ChatGPT and DeepSeek.
- Speed Comparisons:
- Llama 4: 2,600 tokens per second
- ChatGPT: 130 tokens per second
- DeepSeek: 25 tokens per second
Wang highlights that traditional GPU speeds for models like Gemini and GPT are roughly around 100 tokens per second, which can hinder performance in reasoning tasks and more complex interactions.
New Applications Enabled by Speed
The impressive speed of the Llama API opens doors to innovative applications that were previously impractical, such as:
- Real-time agents
- Interactive voice systems
- Code generation
- Instant multi-step reasoning
These advancements require chaining multiple large language model calls, which can now be completed in just seconds, enhancing user experience significantly.
Transition to a Full-Service AI Provider
By introducing the Llama API, Meta is transitioning from being solely a model provider to offering full-service AI infrastructure. This transition allows them to tap into the commercial potential of their AI investments while still maintaining a focus on open models. The API will provide tools for fine-tuning and evaluation, starting with the Llama 3.3 8B model.
Importantly, Meta assures developers that customer data will not be used for training its own models, and users can transfer their models to other hosts, preserving their freedom to choose.
Infrastructure Support from Cerebras
Cerebras will support Meta’s new service through its extensive network of data centers located across North America, including in major locations such as Dallas and Montreal. These facilities will ensure balanced workload management and optimal service delivery. Choi noted that all currently serving data centers are based in North America.
Future Directions: Multiple High-Performance Options
Besides its partnership with Cerebras, Meta has announced collaboration with Groq to provide additional fast inference options. This partnership aims to give developers more high-performance alternatives beyond the conventional GPU-based solutions.
Competitive Positioning in the AI Space
Meta’s entry into the inference API market, armed with high-performance metrics, could reshape the competitive landscape currently dominated by OpenAI, Google, and Anthropic. With a user base of around three billion and advanced data centers, Meta is positioning itself as a formidable force in the commercial AI field.
This focus on speed and performance through specialized silicon may signify a new era in AI, where the capability to deliver rapid results is as critical as what the models can comprehend. The innovative solutions offered by Meta and Cerebras hint at exciting developments in AI applications that could redefine industry standards.