Benchmarks Reveal DeepSeek-V3-0324’s Greater Vulnerability Compared to Qwen2.5-Max

Benchmarks Reveal DeepSeek-V3-0324's Greater Vulnerability Compared to Qwen2.5-Max

Overview of Qwen2.5-Max

Qwen2.5-Max, released on January 28, 2025, is a language model developed by Alibaba categorized as a Mixture-of-Experts (MoE). This model excels in generating text, language comprehension, and advanced logical reasoning. Recent tests indicate that Qwen2.5-Max offers enhanced security features compared to its competitor, DeepSeek-V3-0324.

Vulnerability Scanning with Recon

Purpose of Recon Tool

Protect AI, the organization behind Recon, utilizes this platform for red teaming and identifying security vulnerabilities in various AI models. Their analysts recently compared Qwen2.5-Max with DeepSeek-V3, using Recon to analyze potential security flaws.

Key Findings

According to the analysis, Qwen2.5-Max demonstrates stronger security than DeepSeek-V3. The report highlighted that Recon achieved an attack success rate (ASR) that is nearly 25% higher for DeepSeek-V3 compared to Qwen2.5-Max. However, it’s important to note that Qwen2.5-Max is not without its vulnerabilities. The AI model appears particularly vulnerable to prompt injection attacks, which accounted for approximately 48% of all successful intrusions, while evasion and jailbreak attacks had an ASR of about 40% each.

Identifying Vulnerabilities in DeepSeek-V3

Categories of Vulnerabilities

Recon employs a detailed Attack Library to evaluate modern AI models, focusing on six critical categories:

  • Evasion techniques
  • System prompt leaks
  • Prompt injection attacks
  • AI jailbreak attempts
  • General safety controls
  • Adversarial suffix resistance

The tool not only simulates cyber threats but also examines the models’ resilience against generating potentially harmful content. In adversarial suffix resistance tests, Recon assessed how manipulable the AI models are in terms of content generation.

Results of the Assessment

The Protect AI team ran Recon tests on both Qwen2.5-Max and DeepSeek-V3. The results revealed that Qwen2.5-Max had a lower ASR across various attack types, such as jailbreak methods and prompt injections. For example, Qwen2.5-Max recorded an ASR of 47% against prompt injection, while DeepSeek-V3 had a significantly higher ASR of 77%. Similarly, Qwen2.5-Max had an ASR of 39.4% for evasion techniques compared to DeepSeek-V3’s 69.2%.

Evaluating DeepSeek-V3’s Strengths

Despite facing security challenges, DeepSeek-V3-0324 excels in specific performance benchmarks, indicating its strengths in various contexts. Higher scores in these tests signify better performance, contrasting with ASR metrics.

Performance Benchmarks

BenchmarkDeepSeek-V3-0324Qwen2.5-Max
MMLU-Pro81.275.9
GPQA Diamond68.459.1
MATH-50094.090.2
AIME 202459.439.6
LiveCodeBench49.239.2

From these benchmarks, DeepSeek-V3-0324 outperforms Qwen2.5-Max in areas like general language understanding (MMLU-Pro), advanced fields such as biology, chemistry, and physics (GPQA Diamond), mathematics (MATH-500), medical AI (AIME 2024), and coding (LiveCodeBench).

By examining both models through various analytical lenses, we see distinct strengths and weaknesses that define their operational capabilities.

Please follow and like us:

Related