WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Simple Test-Time Scaling

Posted on
Tags: Test-time Scaling, Language Models

This note is for Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., & Hashimoto, T. (2025). s1: Simple test-time scaling (No. arXiv:2501.19393). arXiv. https://doi.org/10.48550/arXiv.2501.19393

test-time scaling is a new approach to language modeling that uses extra test-time compute to improve performance

  1. curate a small dataset s1K of 1000 questions paired with reasoning traces relying on three criteria they validate through ablations: difficulty, diversity, and quality
  2. develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end
    • this can lead to model to double-check its answer, often fixing incorrect reasoning steps

after supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equiping it with budget forcing, the model s1-32B exceeds o1-preview on competition math questions

scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention.

Introduction

  • performance improvements of language models over the past years have largely relied on scaling up train-time compute using large-scale self-supervised pretraining
  • a new scaling paradigm: test-time scaling. The aim is to increase the compute at test time to get better results
  • OpenAI-o1 demonstrated strong reasoning performance with consistent gains from scaling test-time compute. OpenAI describes their approach as using large-scale reinforcement learning implying the use of sizable amount of data

this led to various attempts to replicate their models relying on techniques like

  • Monte Carlo Tree Search
  • multi-agent approaches
  • others

Among them, DeepSeek R1 has successfully replicated o1-level performance, also employing reinforcement learning via millions of samples and multiple training stages

however, despite the large number of o1 replication attempts, none have openly replicated a clear test-time scaling behavior.

Question: what is the simplest approach to achieve both test-time scaling and strong reasoning performance?

the paper shows that training on only 1000 samples with next-token prediction and controlling thinking duration via a simple test-time technique (budget forcing) leads to a strong reasoning model that scales in performance with more test-time compute.

Image


Published in categories