All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models - AI 论文深度分析

TL;DR
The paper reveals RL models in VLMs lose divergent thinking due to GRPO's diversity collapse within 20 steps. While RL models excel at single attempts, base models solve more problems through multiple samplings.

已证实

证据不足

无法验证

N/A

可复现性

置信度

70%

核心问题

The paper investigates whether RL models truly outperform their base counterparts in Vision-Language Models, and why RL training diminishes divergent thinking capabilities.

核心方法

{'approach': 'The authors compare RL models with base models at 3B and 7B scales using acc@k metrics and diversity measurements via Qwen3-Embedding. They propose MUPO, which partitions responses into K groups using constrained clustering, performs localized advantage estimation within each group, and introduces diversity rewards based on cosine distance between groups with cosine annealing schedule.', 'key_components': [], 'section_ids': ['sec_6']}

论点验证

已证实 (75%) Our results indicate that, when limited to a single attempt, RL models generally achieve higher accuracy than their base counterparts. However, when multiple samplings are permitted, base models are consistently capable of solving a broader number of problems.
The claim is supported by Figure 2 (A) and (B) which shows acc@k results for k=1,2,4 across multiple model scales and benchmarks. The paper describes the pattern: RL models outperform at k=1 but base models solve more problems as k increases. However

证据不足 (50%) Notably, in failure cases where RL models are unable to handle despite multiple attempts, we find that base models often succeed by leveraging through diverse and alternative reasoning pathways that are not captured by RL models.
The claim about base models 'often' succeeding where RL models fail is made without quantitative evidence. While Figure 4 shows t-SNE visualizations suggesting base models have more diverse reasoning spaces, no statistics are provided on how frequent

证据不足 (45%) For instance, as shown in Fig. 1, for geometry problems, RL models tend to rely exclusively on equation solving, which is prone to logical errors. In contrast, base models are capable of proposing simpler, verification-based strategies.
This is presented as a single illustrative example from Figure 1, not a systematic finding. The claim generalizes ('RL models tend to rely exclusively on equation solving') from one or a few examples. No quantitative analysis is provided showing this

证据不足 (35%) Similarly, in object counting involving large quantities, RL models consistently adopt tedious sequential enumeration. Base models, however, are often able to employ efficient elimination strategies that reach the correct answer in significantly fewer steps.
This claim about object counting strategies is made without any cited figure, table, or quantitative data. The claim about 'significantly fewer steps' lacks numerical support. No systematic analysis of counting problems is presented. This appears to

证据不足 (55%) These observations reveal a fundamental distinction between RL models and their vanilla base counterparts. During reasoning, RL models, despite demonstrating deeper deliberation, tend to be conservative, often adhering to a dominant strategy. In contrast, base models, although less refined along a single path, display divergent thinking, frequently exploring potential alternative solutions.
This is a synthesis claim combining multiple observations. The 'deeper deliberation' aspect is not quantified. The 'divergent thinking' claim is partially supported by diversity measurements and t-SNE visualizations, but the characterization of RL mo

证据不足 (60%) Our further experiments reveal that models also exhibit similar patterns, where greater diversity in reasoning significantly enhances the probability of reaching correct answers.
Figure 2 (C) and (D) show scatter plots with regression lines suggesting positive correlation between diversity and acc@4. However, no correlation coefficient, p-value, or statistical significance is reported. The claim of 'significantly enhances' la

已证实 (80%) We observe that during the early training stage, diversity declines sharply to a negligible level, suggesting the model rapidly converges on a narrow set of strategies while discarding the vast majority of potential paths.
Figure 3 (A) is referenced showing diversity decline during early training. The specific claim about 'first 20 steps' provides quantitative detail. The observation is directly supported by the training dynamics visualization. However, 'negligible lev

证据不足 (45%) This collapse in diversity has two issues: 1) exploitation over exploration, where the model prioritizes a dominant strategy over exploring alternative modes, leading to local optima; 2) poor scalability, where convergent reasoning struggles to cover the broad spectrum of problems, thereby constraining test-time scaling capabilities.
This limitation claim makes two assertions: (1) exploitation over exploration leading to local optima, and (2) poor scalability. The 'local optima' claim is theoretical and not directly demonstrated experimentally. The scalability argument is inferre

已证实 (85%) We propose Multi-Group Policy Optimization (MUPO), a drop-in replacement of GRPO to incentivize divergent reasoning.
MUPO is clearly proposed as a method in paragraphs 7 and 23-29. The method is fully specified with equations, and experimental validation demonstrates it works as a replacement for GRPO using the same training setup. The 'drop-in replacement' aspect

已证实 (85%) Inspired by the diversity collapse in GRPO, we partition model responses into multiple groups. Instead of computing advantages globally, MUPO performs localized advantage estimation within each group and introduces diversity reward to promote greater separation among groups.
The methodological contribution is fully specified: partitioning into groups (Eq. 4-5), localized advantage estimation (Eq. 5), and diversity reward (Eq. 6-8). The approach is clearly described with mathematical formulations and implementation detail

已证实 (80%) Our result model, MUPO-Thinker-7B, demonstrates the ability to explore diverse reasoning paths in search of globally optimal solutions and achieves average gains of 2 ∼ 7% over strong baselines on established benchmarks, setting a new state of the art.
Specific numerical improvements are provided: 2.5% (49.1%→51.6%) on mathematical benchmarks and 2.3% (63.3%→65.6%) on general-purpose benchmarks for 7B model. The '2~7%' range encompasses the reported gains. Tables 1-3 provide detailed benchmark resu

已证实 (70%) We highlight a fundamental difference in reasoning behavior between RL models and base models: the former engage in deeper yet narrowly focused reasoning, whereas the latter, despite being less sophisticated, exhibit broader and more diverse thought patterns.
This contribution claim summarizes the behavioral analysis findings. The evidence includes Figure 2 (acc@k patterns), diversity measurements, and t-SNE visualizations. The characterization is supported by data, though some aspects ('deeper deliberati

已证实 (70%) We find that GRPO is prone to diversity collapse, causing the model to search within a narrow set of strategies while disregarding the majority of potential alternatives, leading to local optima and limited scaling capabilities.
The diversity collapse finding is supported by Figure 3 (A) showing sharp decline in early training. The consequences (local optima, limited scaling) are argued based on this observation and the acc@k results, though not directly proven experimentall

已证实 (85%) We propose MUPO, a straightforward yet effective policy algorithm designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness through comprehensive experimental validation.
MUPO is proposed with full methodological specification and validated through comprehensive experiments (Tables 1-3, ablation studies in Fig. 9, Table 4). The method is demonstrated to be effective with specific numerical improvements reported.

已证实 (90%) We consider models at two scales: 3B and 7B. For the 3B setting, we select VLAA-Thinker and LMM-R1, while for the 7B setting, we consider Vision-R1 and R1-OneVision, with Qwen2.5-VL-3B and Qwen2.5-VL-7B being the base models.
This is a straightforward methodological choice clearly specified in paragraph 14. The model selections (VLAA-Thinker, LMM-R1, Vision-R1, R1-OneVision) and base models (Qwen2.5-VL-3B/7B) are explicitly stated. No justification needed for model select

已证实 (90%) All models are evaluated on a suite of reasoning-centric benchmarks, including MathVerse, LogicVista, We-Math and HallusionBench. The sampling temperature is set to 1.0 to enable generation of multiple responses.
The benchmark selection and temperature setting are clearly specified in paragraph 14. The temperature=1.0 choice is justified as enabling generation of multiple responses for the acc@k metric.

已证实 (90%) Beyond standard accuracy that only evaluates a single path, we introduce a more relaxed metric, acc@k, which is considered positive if at least one of k sampled trajectories leads to the correct answer. We set k = 1, 2 and 4 to assess the model's ability to reach the correct answer when given multiple attempts.
The acc@k metric is clearly defined in paragraph 15 with specific k values (1, 2, 4). The metric definition is standard and well-specified for evaluating multi-sample performance.

已证实 (85%) To quantify reasoning diversity, we employ Qwen3-Embedding-0.6B to encode the reasoning segments of generated responses and compute cosine distances to measure differences. The diversity across multiple trajectories is then calculated as the pairwise average.
The diversity quantification method is fully specified: Qwen3-Embedding-0.6B for encoding, cosine distance for measuring differences, pairwise average for diversity calculation. The approach is clearly described and reproducible.

已证实 (75%) As shown in Fig. 2 (A) and (B), when k = 1, RL models markedly outperform their base counterparts, reflecting sophisticated reasoning along a single trajectory. However, as k increases, a notable shift occurs: base models succeed in solving substantially more problems, while the gains of RL models remain marginal.
Same as claim_1 - supported by Figure 2 (A) and (B) showing acc@k patterns. The qualitative description ('markedly outperform', 'substantially more problems', 'marginal gains') lacks specific numerical values in text but is consistent with the refere

证据不足 (55%) We observe a strong positive correlation: as the reasoning diversity increases, acc@4 improves substantially. Intuitively, this indicates that tackling a problem through diverse paths rather than adhering to a single strategy facilitates the discovery of correct answers.
Figure 2 (C) and (D) show scatter plots with regression lines, but no correlation coefficient or statistical significance is reported. The claim of 'strong positive correlation' and 'improves substantially' lacks quantitative validation. Visual inspe

... 共 49 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

No code repository available
No training/evaluation data (ViRL39K) publicly available
Random seeds not specified for reproducibility
Hardware specifications (GPU type, number of GPUs, memory) not provided
Batch size not specified
Optimizer type and additional optimizer settings (weight decay, gradient clipping, etc.) not mentioned
Training time and computational resources not reported
Data preprocessing steps not described
Appendix A referenced for more configuration details but not accessible
Exact implementation of evaluation metrics (acc@1, acc@4) not provided

局限性（作者自述）

This collapse in diversity has two issues: 1) exploitation over exploration, where the model prioritizes a dominant strategy over exploring alternative modes, leading to local optima; 2) poor scalability, where convergent reasoning struggles to cover the broad spectrum of problems, thereby constraining test-time scaling capabilities.
Such phenomenon has two key issues. 1) Exploitation over exploration: the model favors unimodal optimization, neglecting alternative modes and thereby becoming susceptible to local optima; 2) Limited scalability: as discussed in Section 3.1, this convergent reasoning fails to generalize across a broad range of questions, constraining the scaling capabilities.
More results such as ablation of β and limitation analysis are provided in Appendix.

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-04-21T13:28:49+00:00 · 数据来源：Paper Collector