RubricEM introduces rubric-guided RL for deep research agents via stagewise policy decomposition, Stage-Structured GRPO for credit assignment, and reflection meta-policy training. RubricEM-8B achieves SOTA among open models (55.5 avg) on four benchmarks, surpassing prior work and its teacher.
核心问题
How can reinforcement learning train deep research agents beyond verifiable rewards, while enabling long-horizon credit assignment and learning from experience?
核心方法
{'approach': 'RubricEM introduces a rubric-guided structured reasoning scaffold with four stages (Plan, Research, Review, Answer), Stage-Structured GRPO (SS-GRPO) for fine-grained credit assignment using stage-specific rubric-based evaluation, and reflection meta-policy training that enables experience reuse through a rubric bank with both within-episode and cross-episode retrieval modes.', 'key_components': ['Gemini-3.1-Pro serves as the teacher model with ~13K queries from the Dr. Tulu dataset.', '
论点验证
The paper provides a complete specification of the RubricEM framework across multiple sections (p_2-p_7). The three main components are fully described: rubric-guided reasoning scaffold (p_4, p_12-p_17), SS-GRPO (p_5, p_20-p_25), and reflection meta-
The paper provides detailed specification of the rubric-guided reasoning scaffold in p_12-p_17, including the four stages (Plan, Research, Review, Answer), XML schema, and how rubrics are generated during planning and carried through subsequent stage
The paper provides complete mathematical formulation of SS-GRPO in p_20-p_25, including the stage-dependence matrix Λ, return formula G_Λ, stagewise normalization, and the objective function. The method is fully specified with equations and experimen
The paper provides detailed specification of the reflection meta-policy in p_26-p_30 and p_114-p_153, including the shared backbone architecture, reflection candidate generation, judge scoring, and the training procedure. The method is fully specifie
The paper describes the asynchronous reflection branch design in p_30 and p_114-p_133, including the three concurrent threads and one-step staleness approach. The design is justified as avoiding sequential bottlenecks that plagued prior meta-RL work.
The paper explicitly states the model architecture (8B backbone from Qwen3-8B) and training steps (1400 RL steps) in p_7 and p_33. These are factual claims about the experimental setup that are clearly specified.
The paper provides detailed specification of the four stages and their XML schema in p_12-p_17, including specific tags like
The paper clearly specifies the teacher-student distillation approach from Gemini-3.1-Pro in p_19 and p_67-p_91, including the data generation pipeline and the prompting strategy.
The paper provides detailed rejection sampling criteria in p_81-p_86, including specific failure modes like missing tag, no valid tool call, missing structural elements, and consecutive tool errors.
The paper explicitly states the search engines used in p_31.
The paper explicitly states the data split between SFT and RL stages in p_31.
The paper provides quantitative results in Table 1 (referenced in p_32-p_33) showing RubricEM-8B-RL achieving 55.5 average score. The claim about being highest among non-proprietary systems is supported by the reported numbers. Self-reported results
The paper provides specific scores in Table 1 showing RubricEM-8B-RL (55.5) surpassing DR Tulu-8B-RL (51.6), Tongyi DeepResearch-30B-A3B (51.7), and WebThinker-32B-DPO (50.0). Self-reported results cap confidence at 0.85.
The paper states this comparison in p_33. The claim about outperforming Perplexity Deep Research on average and being within 4.4 points of OpenAI Deep Research while outperforming on DRB is provided. However, the exact benchmark-by-benchmark breakdow
The paper provides specific numbers in p_34: RL improves average score from 49.2 to 55.5. The claim about gains on all four benchmarks is stated. Self-reported results cap confidence at 0.85.
The paper states this interesting finding in p_34 - that the student (after RL) surpasses the teacher (Gemini-3.1-Pro). This is a significant claim supported by the reported average scores. Self-reported results cap confidence.
The paper provides specific comparison numbers in p_34: RubricEM uses 1400 RL steps vs DR Tulu's 1900 steps, starts from a stronger SFT checkpoint, and reaches a higher final score. Self-reported results cap confidence.
The paper describes the ablation study in p_35 with four recipes compared under a 600-step budget. The finding that both SS-GRPO and Meta-Policy improve over Baseline-RL, and the full recipe performs best, is supported by Fig. 5. Self-reported ablati
The paper describes the scaffold comparison results in p_36, referencing Fig. 6(a) and 6(b). The claims about improved distillation quality and more effective RL are supported by the figure. Self-reported results cap confidence.
The paper states this observation in p_36, referencing Fig. 6(b). The empirical observation about small and unstable RL gains without the scaffold is supported. The interpretation about structure for exploration and credit assignment is the authors'
... 共 49 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- No code repository available - implementation details cannot be verified or reused
- No training data available despite mention of 'data available' - unclear how to access the ~13K training queries
- Specific hyperparameters not provided (learning rate, batch size, number of epochs/training steps, optimizer settings)
- Random seeds not specified for reproducibility of experiments
- Hardware specifications not provided (GPU type, number of devices, memory requirements)
- Exact training/validation/test data splits not specified
- RL-specific hyperparameters missing (GRPO parameters, reward scaling, discount factors, policy gradient settings)
- Appendix F referenced for full details but not available in provided content
- Evaluation metrics implementation details not fully specified
- vLLM configuration and inference parameters not detailed
局限性(作者自述)
- Because Gemini often fails to produce a well-formed closing tag, a substantial fraction of trajectories (~15-25%) are discarded at this stage.
- Finally, this process yields around 11k SFT samples, approximately 2k fewer than DR Tulu due to the repeated errors described above, which are filtered out.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-05-14T07:22:44+00:00 · 数据来源:Paper Collector