When Does Verification Pay Off?
A Closer Look at LLMs as Solution Verifiers

Jack Lu*, Ryan Teehan*, Jinran Jin, Mengye Ren

New York University

Abstract

Large language models (LLMs) can act as both problem solvers and solution verifiers, with verifiers improving solver performance by selecting high-quality answers from a pool of candidates. However, prior studies of solver-verifier interactions have been limited, focusing mainly on self-verification and rarely examining how verifiers judge outputs from models in their own or in another model family. Modern LLMs also undergo extensive post-training, but its effect on verification remains unclear. We present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. We compare self-verification with verification within the same family and across different families. To support this, we introduce and empirically validate verifier gain, a metric that predicts the performance improvements from test-time verifier-based rejection sampling. We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability. Our findings show that cross-family verification is especially effective; post-training reduces self-improvement but strengthens cross-family improvement; and mathematically or logically structured tasks exhibit the highest inherent verifiability.

Overview of Our Study

Let's quickly go over our experimental setup, how we measure verification ability, and different verification settings.

Models

We treat every model as both a solver and a verifier. As a solver, each model generates a chain-of-thought solution per problem; as a verifier, it reads the original problem and a solution to decide whether the solution is correct (also with chain-of-thought). We evaluate 37 models, including 21 post-trained and 16 base models from the Llama3, Qwen2.5, Qwen3, and DeepSeek-R1 families spanning 0.5B-72B parameters.

Datasets

We evaluate our models on 6 real-world tasks (GSM8K, AIME, CSQA, GPQA, MMLU-STEM, and MMLU-Social-Sciences) and 3 synthetic tasks (3SAT, Sudoku, and Matrix Multiplication).

Metrics

Verifier performance has multiple dimensions, so we track classification metrics like accuracy, false positive rate (FPR), false negative rate (FNR), and also the downstream effect of verifier-based rejection sampling via verifier gain, defined as

\( \underbrace{Gain(S, V; D)}_{\substack{\text{performance improvement} \\ \text{from using verifier}}} \;=\; \underbrace{Precision(S, V; D)}_{\substack{\text{accuracy after test-time} \\ \text{rejection sampling with verifier}}} \;-\; \underbrace{SolverAccuracy(S; D)}_{\text{base solver's accuracy}} \)

where \(V\) is a verifier, \(S\) is a solver, and \(D\) is a dataset.

Verification Settings

We compare three ways of pairing solvers and verifiers:

Self-Verification: the same model acts as both solver and verifier.
Intra-Family Verification: solver and verifier come from the same model family but have different sizes.
Cross-Family Verification: solver and verifier are drawn from different model families or differ in base vs. post-trained.

Do Better Solvers Make Better Verifiers?

We first analyze whether a model's solver performance correlates with its performance as a verifier. For each of our 21 post-trained models and each dataset, we evaluate verification on the same set of solver models to obtain verifier accuracy, FPR, FNR, and gain for every solver-verifier pair. For each verifier, we then divide the verifier metrics into three verification settings and average within each setting over solvers and datasets. From the figure below, we realize the answer to this question depends on the verification setting and discover the following takeaways.

Verifier models are biased toward accepting incorrect solutions when performing self-verification or intra-family verification.
Verification accuracy alone is not a reliable predictor of how much a verifier can improve a solver at test time. Instead, computing verifier gain using solver accuracy, verifier FPR, and verifier FNR provides a more reliable metric.
While model families like Llama3 and Qwen2.5 show some ability to self-improve based on their verifier gains, stronger model families like DeepSeek and Qwen3 do not.

Is Verifier Gain a Good Predictor for Improvements from Resampling?

Our verifier gain metric estimates the expected improvement in a solver's accuracy when using a verifier for rejection sampling. To assess how well this metric predicts real performance, we conduct rejection sampling experiments across all solver-verifier pairs from a 12-model subset of our post-trained models. For each problem in each dataset, the solver generates solutions until the verifier labels one as correct, for up to 9 attempts; if no such solution is found, we retain the final attempt.

The verifier gain is a reliable predictor of performance improvements under rejection sampling. Crucially, it can be estimated from one round of verification without requiring computationally expensive rejection sampling experiments.

Are Verifiers Biased Toward Solutions That Resemble Their Own?

From the last sections, we saw that reasoning models benefit less from self- and intra-family verification due to high FPR (in comparison to cross-family verification), hinting at an LLM bias in accepting incorrect solutions that resemble their own. To directly investigate this behavior, we conduct cross-verification experiments using 12 post-trained models and compute all verifier metrics for each pair. For each pair, we plot the verifier metric against the solver-verifier similarity score, defined as the average cosine similarity between the two models' solution embeddings across all dataset problems. The figure below confirms this hypothesis.

Higher similarity between solver and verifier solution distributions increases the verifier's tendency to accept incorrect solver outputs, reducing verifier gain. Using a verifier with a meaningfully different solution distribution mitigates this bias.

How Does Post-Training Affect Verifier Performance?

Our analysis focuses on the Qwen2.5-Base/Qwen2.5 and Qwen3-Base/Qwen3 model pairs. For each model, we compute verifier metrics against all solvers and datasets, partition results by verification setting, and average within families. From the figure below, we realize the following takeaway.

Post-training significantly enhances a base model's problem-solving ability but can reduce its self- or infra-family improvement potential. In contrast, it boosts models' performance in cross-family verification.

Which Datasets are Easy to Verify?

Thus far, we have examined verifier performance and its contribution to solver accuracy through rejection sampling. We now shift to a task-level perspective and ask two questions:

Are tasks that are easy to solve also easy to verify?
Are some tasks inherently easier to verify than others?

For each of our 21 post-trained models and each dataset, we evaluate verification on the same set of solver models to obtain verifier accuracy and gain for every solver-verifier pair, average them across all verifier models, and plot them against solver accuracies. From the figure below, we realize the following takeaway.

Although tasks that are easy to solve are typically easier to verify, some tasks are inherently easier to verify. These include synthetic problems with logical or structured reasoning (e.g., 3SAT, Sudoku) and real-world tasks relying primarily on mathematical reasoning rather than extensive factual recall (e.g., GSM8K, AIME). Such tasks also yield larger gains from test-time rejection sampling with verifiers.

A Practical Checklist on Selecting Verifiers

Altogether, our results suggest the following practical checklist:

Check whether the task is easier to verify than to solve. Tasks involving logical or mathematical reasoning often yield higher verifier gains, whereas knowledge-recall tasks (e.g., factual QA, domain-specific problems) may offer little benefit from verification relative to simply using the solver.
Use verifier gain, not accuracy, to evaluate a solver-verifier pair. Verification accuracy can be misleading for assessing whether a verifier can improve a solver's accuracy from rejection sampling, while Section 5.2 demonstrates that verifier gain reliably predicts the actual boost from rejection sampling.
Prefer verifiers that “think differently” from the solver. Solution-distribution similarity increases false positives and reduces gains. Practitioners should therefore prefer verifiers from different model families or training distributions than the solver.
Avoid using strong reasoning models as their own verifiers State-of-the-art models such as Qwen3 and DeepSeek achieve minimal self-verification gain, despite being strong solvers.

BibTeX

@misc{lu2025llmverification,
      title={When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers},
      author={Jack Lu and Ryan Teehan and Jinran Jin and Mengye Ren},
      year={2025},
      eprint={2512.02304},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

When Does Verification Pay Off?A Closer Look at LLMs as Solution Verifiers