OpenAI Introduces BrowseComp: A New Benchmark for Assessing AI Web Search Performance

OpenAI’s BrowseComp: A New Standard for Measuring Web Search Capabilities
In the rapidly evolving landscape of artificial intelligence, OpenAI has unveiled an innovative benchmark called BrowseComp, designed specifically to evaluate the web-search capabilities of AI agents. This new benchmark addresses the need to measure how effectively AI can find intricate and complex information online, surpassing earlier evaluations like SimpleQA, introduced in late 2024.
Understanding BrowseComp
What is BrowseComp?
BrowseComp—short for Browsing Competition—is a benchmark consisting of 1,266 challenging yet easily graded questions. It was engineered to assess whether AI models can tackle specific queries that are typically difficult to answer.
Criteria for Questions
The questions in BrowseComp are created under strict guidelines to ensure they remain challenging for both AI and humans. The criteria include:
- Unsolvable by Standard AI Models: Each question cannot be answered by popular models like GPT-4o, OpenAI o1, or earlier versions of AI from Deep research.
- Search Depth: A human trainer performs five searches on various search engines to confirm that the answers do not appear on the first page of results.
- Time Constraint: Questions are designed to take longer than 10 minutes for a human to solve; if over 40% of trainers can answer a question correctly, it is reworked to increase difficulty.
The Makeup of the Questions
The questions span a wide array of subjects, ensuring a comprehensive test of AI capabilities. The categories include:
- TV shows and movies: 16.2%
- Science and technology: 13.7%
- Art: 10%
- History: 9.9%
- Sports: 9.7%
- Music: 9.2%
- Games: 5.6%
- Geography: 5.5%
- Politics: 4.7%
- Other topics: 15.6%
Performance Insights
In a series of tests, human trainers tackled 1,255 BrowseComp questions. Among these:
- 367 questions (approximately 29.2%) were answered within a two-hour frame.
- 317 questions (or 86.4% of solvable problems) were answered correctly.
This indicates a diverse range of question difficulty, where some could be quickly resolved while others took significantly longer.
AI Models’ Results
The performance of various AI models on the BrowseComp test varied greatly:
Model | Correct Answer Rate |
---|---|
GPT-4o | 0.6% |
GPT-4o web search function | 1.9% |
GPT-4.5 | 0.9% |
OpenAI o1 | 9.9% |
Deep research | 51.5% |
Notably, the Deep Research model outperformed its competitors, achieving a 51.5% success rate in finding correct answers, while other models lagged significantly behind.
Analyzing AI Performance
The BrowseComp benchmark also includes insights on how AI models improve with increased computational resources. The more advanced the inference, the higher the scores demonstrated by the models. Furthermore, when multiple answers were generated and selected based on various methods—such as the Best of N or Majority Voting—the accuracy rates improved significantly.
During deep analysis, it was found that:
- 16% of the problems had a 100% correct answer rate.
- 14% of problems resulted in 0% correct answers.
- In cases with zero correct answers, evidence was often found to substantiate correct solutions when supported by the right information.
The Significance of BrowseComp
BrowseComp not only assesses the direct ability to find answers but also evaluates an AI’s skill in reconstructing searches and combining information from diverse sources. It presents a crucial tool to advance AI development while offering insights into how effectively AI can interpret complex queries. The benchmark is part of OpenAI’s evolving set of evaluation tools, including the simple-evals framework, aimed at enhancing future AI agents’ abilities in navigating the vast realms of online information.
For more in-depth insights or details on this benchmark, OpenAI has provided additional resources, including a comprehensive paper available in PDF format.