DeepMind Unveils QuestBench for Assessing LLMs in Logic and Mathematics Tasks

DeepMind Unveils QuestBench for Assessing LLMs in Logic and Mathematics Tasks

Understanding QuestBench: Evaluating Large Language Models

Google DeepMind has developed a new benchmark called QuestBench, designed to evaluate whether Large Language Models (LLMs) can identify the critical question necessary to tackle logic, planning, and math problems. Recently, the DeepMind team published an article detailing QuestBench, which is a collection of reasoning tasks that require asking at most one question.

The Role of Large Language Models

Large Language Models are increasingly used for various reasoning tasks, including math, logic, and coding. Typically, these tasks assume that all necessary information has been provided. However, in the real world, these problems are often underspecified, meaning critical details may be missing. For instance, someone might leave out important information in a math problem, or robots in a factory might operate in environments where they can only see part of their surroundings. In such cases, LLMs must be able to ask clarifying questions to gather the missing information.

Investigating Information-Gathering in LLMs

The DeepMind team’s research aims to determine if LLMs can generate accurate clarifying questions for underspecified reasoning problems. They want to assess how well an LLM can identify the minimal necessary question to ask and gauge the difficulty levels of different problems.

The Concept of Underspecified Problems

The information-gathering challenge is framed as an underspecified Constraint Satisfaction Problem (CSP). A CSP is a mathematical problem comprised of a set of objects that need to satisfy specific constraints. An underspecified problem occurs when the value of the target variable cannot be determined from the available data.

This distinction is crucial: semantic ambiguity refers to situations where multiple interpretations exist, but each interpretation leads to solvable answers, whereas underspecification means that further information is needed to solve the problem. QuestBench specifically focuses on cases where users do not provide sufficient data for the language model to generate a solution.

QuestBench Categories

The evaluation of LLMs under QuestBench includes structured reasoning tasks with a clearly defined outcome. Each task requires the model to ask only one question, making the assessment of the LLM’s information-gathering abilities straightforward. The researchers categorized these tasks into four specific areas:

  1. Logic-Q: Involves logical reasoning with one missing proposition.
  2. Planning-Q: Relates to planning issues defined in the Planning Domain Definition Language (PDDL) with partially known initial states.
  3. GSM-Q: Consists of human-annotated grade school math problems that have one missing piece of information.
  4. GSME-Q: Similar to GSM-Q, but these problems are structured as equations.

The Dataset and Evaluation Settings

Figures in the dataset for QuestBench were created to reflect real-life scenarios by constructing 1-sufficient CSPs across the aforementioned reasoning domains. Each instance comprises a user query along with a complete set of question options and a subset containing the correct questions.

To ensure a thorough evaluation, several state-of-the-art LLMs were tested, including models like GPT-4o and Claude 3.5 Sonnet. The assessments were conducted in various formats: zero-shot, chain-of-thought, and four-shot settings. The tests aimed to understand the LLMs’ capacity to recognize and address underspecified information in structured tasks.

Findings from the Research

The team’s evaluation revealed that LLMs performed well when enough information was present, allowing them to discern whether a problem was underspecified. They found a correlation between an LLM’s ability to identify critical questions based on the tasks’ complexity. Although LLMs showed about 80% accuracy in the GSM-Q and GSME-Q categories, their performance dropped below 50% in Logic-Q and Planning-Q tasks.

Interestingly, variations in performance were also noted; simpler algebraic problems were easier for LLMs, while more complex logic and planning issues proved challenging.

Implications for Future LLM Performance

Evaluating LLMs through benchmarks like QuestBench is essential for understanding the strengths and weaknesses of these models. It helps refine their performance in specific use cases and informs future improvements in language modeling. With various evaluation frameworks available, QuestBench adds another dimension to measure the effectiveness of LLMs, specifically in scenarios requiring the identification of missing information.

For further details, visit the official research paper or check out the GitHub project for more insights and access to the evaluation framework.

Please follow and like us:

Related