Benchmarks Reveal DeepSeek-V3-0324’s Greater Vulnerability Compared to Qwen2.5-Max

Launched on January 28, 2025, Qwen2.5-Max is a Mixture-of-Experts (MoE) language model created by Alibaba. This advanced model can generate text, interpret various languages, and perform complex reasoning tasks. Recent evaluations show that Qwen2.5-Max offers higher security features compared to its competitor, DeepSeek-V3-0324.
Assessing Security with Recon Tool
The security firm Protect AI conducted a security assessment using their vulnerability scanning tool, Recon, to evaluate the defensive capabilities of Qwen2.5-Max against DeepSeek-V3-0324. Their findings revealed that the Qwen2.5-Max model has lower vulnerability than its rival, achieving nearly a 25% higher attack success rate (ASR) with Recon.
Despite being a more secure option, Qwen2.5-Max is not immune to vulnerabilities. The tests demonstrated that it is particularly vulnerable to prompt injection attacks, which accounted for approximately 48% of the successful cyberattacks. Other attack types, such as evasion and jailbreak attempts, had a lower success rate of about 40% each.
Identifying Vulnerabilities in DeepSeek-V3
Recon employs an extensive Attack Library to analyze AI models and pinpoint weaknesses across six key categories:
- Evasion Techniques
- System Prompt Leaks
- Prompt Injection Attacks
- AI Jailbreak Attempts
- General Safety Controls
- Adversarial Suffix Resistance
In addition to simulating cyberattacks, Recon also evaluates an AI model’s resistance to producing harmful or illegal content. This is notably examined in the adversarial suffix resistance tests, where Recon attempts to compel the model to generate inappropriate material.
The Protect AI team conducted assessments of both Qwen2.5-Max and DeepSeek-V3, with Qwen2.5-Max displaying a lower attack success rate against various forms of cyberattacks, including jailbreaks and prompt injections. For example, Qwen2.5-Max showed a 47% ASR against prompt injection attacks, starkly lower compared to the 77% for DeepSeek-V3. Similarly, Qwen2.5-Max had a 39.4% ASR against evasion tactics, while DeepSeek-V3 reached 69.2%.
Examining the Strengths of DeepSeek-V3
Even though DeepSeek-V3-0324 has notable security vulnerabilities, it still outperforms Qwen2.5-Max in various benchmarks. In these contexts, a higher score indicates better performance.
DeepSeek-V3-0324 | Qwen2.5-Max | |
---|---|---|
MMLU-Pro | 81.2 | 75.9 |
GPQA Diamond | 68.4 | 59.1 |
MATH-500 | 94.0 | 90.2 |
AIME 2024 | 59.4 | 39.6 |
LiveCodeBench | 49.2 | 39.2 |
These benchmarks indicate that DeepSeek-V3-0324 excels in various disciplines, including understanding general language (MMLU-Pro), advanced subjects such as biology, physics, and chemistry (GPQA Diamond), mathematics (MATH-500), artificial intelligence in medicine (AIME 2024), and programming skills (LiveCodeBench).