A new AI test is surpassing models from OpenAI and Google.

AI Models and the Quest for Artificial General Intelligence (AGI)
Recent assessments suggest that major tech companies like Google, OpenAI, and DeepSeek are still far from achieving Artificial General Intelligence (AGI), as indicated by a new benchmark established by the Arc Prize Foundation. AGI refers to highly autonomous systems that outperform humans across a wide variety of tasks.
The ARC-AGI Benchmark
The Arc Prize Foundation, a nonprofit organization dedicated to measuring advancements in AGI, recently introduced a new test known as ARC-AGI-2. This benchmark evaluates AI models on their capacity for general intelligence through visual puzzles that require pattern recognition, contextual clues, and reasoning. The goal is to assess a model’s ability to understand and solve problems dynamically rather than relying solely on pre-programmed knowledge.
AI Model Scores on the Benchmark
According to the current standings on the ARC-AGI leaderboard, the results from leading AI models reveal significant shortcomings:
- OpenAI’s o3-low: Scored 4% on the second edition, a drastic decline from its 75.7% score on the first edition.
- Google’s Gemini 2.0 Flash: Achieved a score of 1.3%.
- DeepSeek R1: Also landed at 1.3%.
- Anthropic’s Claude 3.7: Scored 0.9%.
These scores highlight both the challenges involved in achieving AGI and the distance that remains in developing AI systems that can match human-like general intelligence.
Opinions on the Timeline for AGI
The topic of how soon we might see AGI is contentious. Notable opinions include:
- Dario Amodei (CEO of Anthropic): Believes AGI could emerge in the next two to three years.
- Sam Altman (CEO of OpenAI): Suggests achieving AGI is feasible with current hardware.
- Experts like Gary Marcus and Yann LeCun argue that the current technology is insufficient, emphasizing the need for caution against overhyped expectations that could mislead investors and the public.
Challenges of Current AI Models
The ARC-AGI benchmark poses a unique challenge by preventing models from merely memorizing answers or generating complex responses without true understanding. Instead, the focus is on simpler puzzles that humans can navigate easily due to our inherent cognitive flexibility and ability to learn from limited experience. The test aims to demonstrate the gaps in current AI capabilities and emphasizes the difference between narrow, specialized intelligence and true general intelligence.
As mentioned in their announcement, "Intelligence requires the ability to generalize from limited experience and apply knowledge in new, unexpected situations." While current AI systems excel in specific tasks, like playing strategy games or recognizing images, they still lack the versatility needed to successfully tackle diverse real-world problems.
The Human-AI Gap
The benchmarks also shed light on what is referred to as the "human-AI gap." This concept underscores the ability of humans to learn new skills efficiently, a trait that current AI systems have yet to acquire.
To better understand the limitations of these AI models, individuals can experience the ARC-AGI test firsthand. It consists of puzzles similar to those found in popular brain teasers, like crossword puzzles or Wordle. The challenge posed by these tasks lies in their requirement for logical reasoning, something that the human brain is naturally attuned to do.
Progress Through Challenges
OpenAI’s performance drop from the first edition of the ARC-AGI to the second highlights the significant hurdles that remain on the path to developing AGI. Although AI has made strides in specialized domains, the consensus remains that reaching a level of intelligence comparable to humans requires extensive research and breakthroughs in understanding general intelligence. As the field evolves, these benchmarks will serve as crucial tools for measuring progress and redefining what is possible with artificial intelligence.