Overview of Our Study
Let's quickly go over our experimental setup, how we measure verification ability, and different
verification settings.
Models
We treat every model as both a solver and a verifier. As a solver, each model generates a
chain-of-thought solution per problem; as a verifier, it reads the original problem and a
solution to decide whether the solution is correct (also with chain-of-thought). We evaluate 37 models, including 21 post-trained and
16 base models from the Llama3,
Qwen2.5, Qwen3, and
DeepSeek-R1 families spanning
0.5B-72B parameters.
Datasets
We evaluate our models on 6 real-world tasks (GSM8K, AIME,
CSQA, GPQA, MMLU-STEM, and MMLU-Social-Sciences) and 3 synthetic tasks
(3SAT, Sudoku, and Matrix Multiplication).
Metrics
Verifier performance has multiple dimensions, so we track classification metrics like
accuracy, false positive rate (FPR), false negative rate (FNR), and also the downstream effect of verifier-based
rejection sampling via verifier gain, defined as
\(
\underbrace{Gain(S, V; D)}_{\substack{\text{performance improvement} \\ \text{from using verifier}}}
\;=\;
\underbrace{Precision(S, V; D)}_{\substack{\text{accuracy after test-time} \\ \text{rejection sampling with verifier}}}
\;-\;
\underbrace{SolverAccuracy(S; D)}_{\text{base solver's accuracy}}
\)
where \(V\) is a verifier, \(S\) is a solver, and \(D\) is a dataset.
Verification Settings
We compare three ways of pairing solvers and verifiers:
- Self-Verification: the same model acts as both solver and verifier.
- Intra-Family Verification: solver and verifier come from the same model family but have different sizes.
- Cross-Family Verification: solver and verifier are drawn from different model families or differ in base vs. post-trained.