Qwen3 Surpasses OpenAI’s O1 And O3-mini, Matching Gemini 2.5 Pro Performance

Alibaba Unveils the Qwen3 Family of AI Models

Chinese tech giant Alibaba has recently announced the launch of the Qwen3 family of open-weight artificial intelligence (AI) models. This new lineup not only includes its flagship model with an impressive 235 billion parameters but also features several smaller variants aimed at various applications.

Models and Variants

The Qwen3 family provides a range of model sizes, including:

0.6 billion parameters
1.7 billion parameters
4 billion parameters
8 billion parameters
14 billion parameters
32 billion parameters
235 billion parameters (featuring 22 billion activated parameters)
30 billion parameters (with 3 billion activated parameters)

These models can be seamlessly deployed locally through tools like Ollama and LM Studio, or accessed online via Qwen Chat, allowing users a variety of options for engaging with the technology.

Operational Modes

One of the interesting aspects of the Qwen3 lineup is its dual operational modes. Users can toggle between:

Thinking Mode: Ideal for tasks that require in-depth reasoning and problem-solving.
Non-Thinking Mode: Designed for quick responses and straightforward queries.

Performance Insights

Alibaba’s Qwen3 model with 235 billion parameters has shown remarkable performance in various benchmarks. It has outperformed OpenAI’s o1 and o3-mini reasoning models in tasks such as mathematics and programming. Furthermore, it also delivers comparable results to Google’s Gemini 2.5 Pro models in several assessments.

Benchmark Comparisons

In the LiveCodeBench coding benchmark, Qwen3 scored 70.7%, while OpenAI’s o4-mini (high) model achieved 80%.
For the AIME 2024 math benchmark, Qwen3 recorded 85.7%, slightly trailing behind OpenAI’s o4-mini (high), which scored 94%.
Additional benchmark scores can be found at Artificial Analysis.

Notably, various smaller versions of the Qwen3 models also surpassed previous iterations, showcasing significant enhancements in capabilities. The 30 billion parameter model alone has outperformed both DeepSeek-V3 and OpenAI’s GPT-4o in benchmarks.

Community Response

Simon Willison, a co-creator of the Django Web Framework, shared thoughts on the release, highlighting the coordinated effort across the LLM (Large Language Model) ecosystem. Willison pointed out that the models functioned effectively across popular LLM serving frameworks immediately following their release, a level of preparedness he found unprecedented in model launches.

Willison remarked, “This is an extraordinary level of coordination for a model release! I haven’t seen any other model providers make this level of effort…” He emphasized that the smaller models, such as the 0.6B and 1.7B, are efficient enough to run on devices like an iPhone, while larger models are compatible with modern desktops.

Evolution from Previous Models

The Qwen3 models are successors to the earlier Qwen2.5 models. Last month, Alibaba introduced the QwQ model with 32 billion parameters, claiming it could achieve performance levels similar to DeepSeek-R1, despite its smaller size. Additionally, the company released the QwQ-Max-Preview model, which was built upon the Qwen2.5 Max framework and is designed specifically for mathematics and coding tasks.

With its range of sizes and capabilities, the Qwen3 family marks a significant step forward in AI model development, catering to various needs and applications across the technology landscape.

Please follow and like us: