Maximizing Inference-Time Compute Efficiency in LLMs: Insights from DeepMind and UC Berkeley

Maximizing Inference-Time Compute Efficiency in LLMs: Insights from DeepMind and UC Berkeley

Stay updated with our daily and weekly newsletters on the latest developments and exclusive insights in the AI industry. Learn More


The training of large language models (LLMs) is often slow and costly. This has raised a critical question in the tech community: can enhancing compute resources during inference lead to better model performance without the need to retrain? A recent study conducted by researchers at DeepMind and the University of California, Berkeley provides insights into this topic. Their findings, outlined in a new research paper, indicate that optimizing inference-time compute can significantly lift the performance of LLMs without needing larger models or extensive pre-training.

Balancing Inference-Time and Pre-Training Compute

The traditional path to enhance LLM effectiveness has focused on increasing both model size and pre-training compute. However, this method can be restrictive. Larger models require more resources during both training and deployment, which limits their practicality for various applications, especially on devices with limited computing power.

Alternatively, enhancing performance during inference by increasing compute resources can lead to accurate outputs for demanding prompts. This strategy allows the deployment of smaller LLMs while achieving results close to those of larger, more costly models.

The main question becomes: If an LLM is given a set amount of inference-time compute, how do we maximize its performance through different inference techniques compared to a larger pre-trained model?

A common method for scaling test-time calculations is best-of-N sampling, where the model generates multiple outputs and selects the best one. However, there are additional strategies to improve performance during inference. For instance, the model can revise its output in sequential steps rather than generating responses in parallel. Combining parallel and sequential sampling with various verification strategies can lead to even better outcomes.

Comparison of Parallel and Sequential Revision Methods (Source: arXiv)

The researchers defined an approach called the “test-time compute-optimal scaling strategy,” which personalizes hyperparameters to maximize performance for specific prompts during testing. They found that ideally, test-time compute should improve the output beyond simple sampling from the LLM itself.

Methods for Using Inference-Time Compute

The researchers investigated two key strategies to enhance LLM performance via inference-time compute:

  • Proposal Distribution Modification: This involves fine-tuning the LLM to allow it to iteratively revise its responses, especially in complex reasoning scenarios.
  • Optimizing the Verifier: This strategy selects the best response from the generated options. It can be improved by training a reward model focused on evaluating accuracy at each step of the response generation process.

To test their methods, they applied both strategies to the challenging MATH benchmark using the PaLM-2 models.

Interestingly, the results indicated that the effectiveness of a specific test-time compute strategy relies heavily on the nature of the problem and the LLM being utilized. For simpler problems where the base LLM can already generate sound responses, iterative refining was more successful than creating multiple parallel samples. Conversely, for complex problems requiring varied solution strategies, parallel response generation or tree-search against a process-based reward model yielded better outcomes.

Various answer verification strategies
Exploration of Various Answer Verification Strategies (Source: arXiv)

The study emphasizes the need for a flexible “compute-optimal” strategy for applying test-time computation tailored to the specific prompt, optimizing the allocation of computational resources. The researchers achieved a significant performance boost, establishing an approach that surpassed the traditional best-of-N method while utilizing only about 25% of the computational power.

Optimizing Compute Resources

The researchers further examined how test-time computation might replace extra pre-training. They juxtaposed a smaller model enhanced by additional test-time compute with a much larger model boosted by pre-training.

For simpler and moderately difficult tasks, the smaller model leveraged with test-time computation matched the performance of the larger model fortified by pre-training. This suggests that sometimes it may be more advantageous to train smaller models with less compute power and enhance them during inference rather than focusing solely on increasing pre-training compute.

However, for the most complex challenges, adding pre-training compute proved to be more effective, suggesting that an increase in test-time computation does not completely replace the benefits of pre-training in every case.

The researchers encourage further exploration into combining different revision and search techniques and refining methods for estimating question difficulty. Overall, the study indicates that leveraging test-time compute could become a more effective strategy than solely increasing pre-training compute, paving the way towards maximizing efficiency in future developments of LLMs.

Please follow and like us:

Related