The paper reveals RL models in VLMs lose divergent thinking due to GRPO's diversity collapse within 20 steps. While RL models excel at single attempts, base models solve more problems through multiple samplings.
核心问题
The paper investigates whether RL models truly outperform their base counterparts in Vision-Language Models, and why RL training diminishes divergent thinking capabilities.
核心方法
{'approach': 'The authors compare RL models with base models at 3B and 7B scales using acc@k metrics and diversity measurements via Qwen3-Embedding. They propose MUPO, which partitions responses into K groups using constrained clustering, performs localized advantage estimation within each group, and introduces diversity rewards based on cosine distance between groups with cosine annealing schedule.', 'key_components': [], 'section_ids': ['sec_6']}
论点验证
The claim is supported by Figure 2 (A) and (B) which shows acc@k results for k=1,2,4 across multiple model scales and benchmarks. The paper describes the pattern: RL models outperform at k=1 but base models solve more problems as k increases. However
The claim about base models 'often' succeeding where RL models fail is made without quantitative evidence. While Figure 4 shows t-SNE visualizations suggesting base models have more diverse reasoning spaces, no statistics are provided on how frequent
This is presented as a single illustrative example from Figure 1, not a systematic finding. The claim generalizes ('RL models tend to rely exclusively on equation solving') from one or a few examples. No quantitative analysis is provided showing this
This claim about object counting strategies is made without any cited figure, table, or quantitative data. The claim about 'significantly fewer steps' lacks numerical support. No systematic analysis of counting problems is presented. This appears to
This is a synthesis claim combining multiple observations. The 'deeper deliberation' aspect is not quantified. The 'divergent thinking' claim is partially supported by diversity measurements and t-SNE visualizations, but the characterization of RL mo
Figure 2 (C) and (D) show scatter plots with regression lines suggesting positive correlation between diversity and acc@4. However, no correlation coefficient, p-value, or statistical significance is reported. The claim of 'significantly enhances' la
Figure 3 (A) is referenced showing diversity decline during early training. The specific claim about 'first 20 steps' provides quantitative detail. The observation is directly supported by the training dynamics visualization. However, 'negligible lev
This limitation claim makes two assertions: (1) exploitation over exploration leading to local optima, and (2) poor scalability. The 'local optima' claim is theoretical and not directly demonstrated experimentally. The scalability argument is inferre
MUPO is clearly proposed as a method in paragraphs 7 and 23-29. The method is fully specified with equations, and experimental validation demonstrates it works as a replacement for GRPO using the same training setup. The 'drop-in replacement' aspect
The methodological contribution is fully specified: partitioning into groups (Eq. 4-5), localized advantage estimation (Eq. 5), and diversity reward (Eq. 6-8). The approach is clearly described with mathematical formulations and implementation detail
Specific numerical improvements are provided: 2.5% (49.1%→51.6%) on mathematical benchmarks and 2.3% (63.3%→65.6%) on general-purpose benchmarks for 7B model. The '2~7%' range encompasses the reported gains. Tables 1-3 provide detailed benchmark resu
This contribution claim summarizes the behavioral analysis findings. The evidence includes Figure 2 (acc@k patterns), diversity measurements, and t-SNE visualizations. The characterization is supported by data, though some aspects ('deeper deliberati
The diversity collapse finding is supported by Figure 3 (A) showing sharp decline in early training. The consequences (local optima, limited scaling) are argued based on this observation and the acc@k results, though not directly proven experimentall
MUPO is proposed with full methodological specification and validated through comprehensive experiments (Tables 1-3, ablation studies in Fig. 9, Table 4). The method is demonstrated to be effective with specific numerical improvements reported.
This is a straightforward methodological choice clearly specified in paragraph 14. The model selections (VLAA-Thinker, LMM-R1, Vision-R1, R1-OneVision) and base models (Qwen2.5-VL-3B/7B) are explicitly stated. No justification needed for model select
The benchmark selection and temperature setting are clearly specified in paragraph 14. The temperature=1.0 choice is justified as enabling generation of multiple responses for the acc@k metric.
The acc@k metric is clearly defined in paragraph 15 with specific k values (1, 2, 4). The metric definition is standard and well-specified for evaluating multi-sample performance.
The diversity quantification method is fully specified: Qwen3-Embedding-0.6B for encoding, cosine distance for measuring differences, pairwise average for diversity calculation. The approach is clearly described and reproducible.
Same as claim_1 - supported by Figure 2 (A) and (B) showing acc@k patterns. The qualitative description ('markedly outperform', 'substantially more problems', 'marginal gains') lacks specific numerical values in text but is consistent with the refere
Figure 2 (C) and (D) show scatter plots with regression lines, but no correlation coefficient or statistical significance is reported. The claim of 'strong positive correlation' and 'improves substantially' lacks quantitative validation. Visual inspe
... 共 49 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- No code repository available
- No training/evaluation data (ViRL39K) publicly available
- Random seeds not specified for reproducibility
- Hardware specifications (GPU type, number of GPUs, memory) not provided
- Batch size not specified
- Optimizer type and additional optimizer settings (weight decay, gradient clipping, etc.) not mentioned
- Training time and computational resources not reported
- Data preprocessing steps not described
- Appendix A referenced for more configuration details but not accessible
- Exact implementation of evaluation metrics (acc@1, acc@4) not provided
局限性(作者自述)
- This collapse in diversity has two issues: 1) exploitation over exploration, where the model prioritizes a dominant strategy over exploring alternative modes, leading to local optima; 2) poor scalability, where convergent reasoning struggles to cover the broad spectrum of problems, thereby constraining test-time scaling capabilities.
- Such phenomenon has two key issues. 1) Exploitation over exploration: the model favors unimodal optimization, neglecting alternative modes and thereby becoming susceptible to local optima; 2) Limited scalability: as discussed in Section 3.1, this convergent reasoning fails to generalize across a broad range of questions, constraining the scaling capabilities.
- More results such as ablation of β and limitation analysis are provided in Appendix.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-21T13:28:49+00:00 · 数据来源:Paper Collector