Executing the Deepseek-R1 671B Model at FP16 Fidelity with Virtualized Workloads – Part 2 of 3

Executing the Deepseek-R1 671B Model at FP16 Precision with Virtualized Workloads

Performance of Dual AMD EPYC 9965 Server with Deepseek-R1 671B FP16

Introduction to the System

When exploring high-performance computing for AI models, the dual AMD EPYC 9965 server paired with the Deepseek-R1 671B FP16 model offers intriguing capabilities. However, it’s essential to set realistic expectations regarding token generation speed. Generally, the system achieves around 1.4 tokens per second without any modifications, which isn’t incredibly fast but can be suitable for applications that allow time for processing.

Deep Model Loading and Performance Insights

Initially, loading the 1.27TB Deepseek-R1 model or making any adjustments to its default settings involves reloading, which can take a few minutes. Here are some performance highlights based on initial settings:

  • Default Performance: Approximately 1.4 tokens/second.
  • Reasoning Capacity: The model’s token generation increases due to a reasoning phase that requires additional computational work.

Using the Ollama interface allows the model to utilize all available CPU cores and loads the model into memory efficiently.

Tuning for Improved Performance

Even minor adjustments can significantly enhance performance. For instance:

  • Batch Size Adjustment: Changing the default batch size from 512 to 1024 can yield better performance.
  • Thread Control: Increasing the number of threads to 256 on a server that supports 784 threads resulted in a boost to 1.9 tokens per second.

In practical terms, reducing the thread count can also optimize performance. Observations showed that when the thread count was decreased by 128, the performance improved to between 2.0 and 2.1 tokens per second. Such optimizations indicate that the default settings can be significantly improved with slight tuning.

Virtualization Impact on Performance

Running the system in a virtualized environment is an option that can offer flexibility for managing multiple workloads. Testing showed that virtualization introduced only a 3-6% decrease in performance. While companies like Microsoft optimize their setups to minimize these losses, the testing here used an untuned configuration for ease of setup.

Coexisting Workloads

The performance of virtual machines alongside the Deepseek-R1 model varies significantly:

  • High Memory Bandwidth Applications: Applications that demand substantial memory bandwidth suffered in performance due to competition for resources.
  • Low Throughput Tasks: However, services such as web servers and DHCP that require less bandwidth operated effectively with minimal impact on both their performance and that of the AI model.

Practical Implications

Leveraging the AMD EPYC 9965’s architecture allows running an extensive AI model and additional applications without compromising performance drastically. The key is to choose which VMs to run that won’t hog memory bandwidth.

Optimization Points

It’s crucial to note that the current state of CPU inference technology isn’t fully optimized. Organizations might find it worthwhile to conduct a more thorough tuning process in the following areas:

  • BIOS Settings: Explore settings that could enhance performance.
  • Operating System Optimization: Ensure the OS is fine-tuned for workloads related to AI inference.
  • AI Software Configuration: Look into adjustments in the AI inference software stack that could support better performance.

Current Versus Future Hardware

The testing conducted utilized the AMD EPYC 9965 processors rather than Intel’s platforms, focusing on maximizing the potential for virtualization and application consolidation. Given the abundance of cores and memory bandwidth offered by the EPYC 9965, it remains a suitable choice for many demanding applications.

Potential Pitfalls to Avoid

While venturing into this setup, it’s advisable to be cautious about specific issues that could diminish efficiency or complicate implementation. Thoughtful planning can help mitigate potential challenges and optimize the system’s capabilities.

Please follow and like us:

Related