Simple Test-Time Scaling
Posted on
test-time scaling is a new approach to language modeling that uses extra test-time compute to improve performance
- curate a small dataset s1K of 1000 questions paired with reasoning traces relying on three criteria they validate through ablations: difficulty, diversity, and quality
- develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end
- this can lead to model to double-check its answer, often fixing incorrect reasoning steps
after supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equiping it with budget forcing, the model s1-32B exceeds o1-preview on competition math questions
scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention.
Introduction
- performance improvements of language models over the past years have largely relied on scaling up train-time compute using large-scale self-supervised pretraining
- a new scaling paradigm: test-time scaling. The aim is to increase the compute at test time to get better results
- OpenAI-o1 demonstrated strong reasoning performance with consistent gains from scaling test-time compute. OpenAI describes their approach as using large-scale reinforcement learning implying the use of sizable amount of data
this led to various attempts to replicate their models relying on techniques like
- Monte Carlo Tree Search
- multi-agent approaches
- others
Among them, DeepSeek R1 has successfully replicated o1-level performance, also employing reinforcement learning via millions of samples and multiple training stages
however, despite the large number of o1 replication attempts, none have openly replicated a clear test-time scaling behavior.
Question: what is the simplest approach to achieve both test-time scaling and strong reasoning performance?
the paper shows that training on only 1000 samples with next-token prediction and controlling thinking duration via a simple test-time technique (budget forcing) leads to a strong reasoning model that scales in performance with more test-time compute.