Executing the Deepseek-R1 671B Model at FP16 Fidelity with Virtualized Workloads – Part 2 of 3

Performance of Dual AMD EPYC 9965 Server with Deepseek-R1 671B FP16
Introduction to the System
When exploring high-performance computing for AI models, the dual AMD EPYC 9965 server paired with the Deepseek-R1 671B FP16 model offers intriguing capabilities. However, it’s essential to set realistic expectations regarding token generation speed. Generally, the system achieves around 1.4 tokens per second without any modifications, which isn’t incredibly fast but can be suitable for applications that allow time for processing.
Deep Model Loading and Performance Insights
Initially, loading the 1.27TB Deepseek-R1 model or making any adjustments to its default settings involves reloading, which can take a few minutes. Here are some performance highlights based on initial settings:
- Default Performance: Approximately 1.4 tokens/second.
- Reasoning Capacity: The model’s token generation increases due to a reasoning phase that requires additional computational work.
Using the Ollama interface allows the model to utilize all available CPU cores and loads the model into memory efficiently.
Tuning for Improved Performance
Even minor adjustments can significantly enhance performance. For instance:
- Batch Size Adjustment: Changing the default batch size from 512 to 1024 can yield better performance.
- Thread Control: Increasing the number of threads to 256 on a server that supports 784 threads resulted in a boost to 1.9 tokens per second.
In practical terms, reducing the thread count can also optimize performance. Observations showed that when the thread count was decreased by 128, the performance improved to between 2.0 and 2.1 tokens per second. Such optimizations indicate that the default settings can be significantly improved with slight tuning.
Virtualization Impact on Performance
Running the system in a virtualized environment is an option that can offer flexibility for managing multiple workloads. Testing showed that virtualization introduced only a 3-6% decrease in performance. While companies like Microsoft optimize their setups to minimize these losses, the testing here used an untuned configuration for ease of setup.
Coexisting Workloads
The performance of virtual machines alongside the Deepseek-R1 model varies significantly:
- High Memory Bandwidth Applications: Applications that demand substantial memory bandwidth suffered in performance due to competition for resources.
- Low Throughput Tasks: However, services such as web servers and DHCP that require less bandwidth operated effectively with minimal impact on both their performance and that of the AI model.
Practical Implications
Leveraging the AMD EPYC 9965’s architecture allows running an extensive AI model and additional applications without compromising performance drastically. The key is to choose which VMs to run that won’t hog memory bandwidth.
Optimization Points
It’s crucial to note that the current state of CPU inference technology isn’t fully optimized. Organizations might find it worthwhile to conduct a more thorough tuning process in the following areas:
- BIOS Settings: Explore settings that could enhance performance.
- Operating System Optimization: Ensure the OS is fine-tuned for workloads related to AI inference.
- AI Software Configuration: Look into adjustments in the AI inference software stack that could support better performance.
Current Versus Future Hardware
The testing conducted utilized the AMD EPYC 9965 processors rather than Intel’s platforms, focusing on maximizing the potential for virtualization and application consolidation. Given the abundance of cores and memory bandwidth offered by the EPYC 9965, it remains a suitable choice for many demanding applications.
Potential Pitfalls to Avoid
While venturing into this setup, it’s advisable to be cautious about specific issues that could diminish efficiency or complicate implementation. Thoughtful planning can help mitigate potential challenges and optimize the system’s capabilities.