SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization - AI 论文深度分析

TL;DR
SKILL0 introduces the first RL framework that internalizes agent skills into model parameters through In-Context Reinforcement Learning and Dynamic Curriculum. By providing skills during training but removing them at inference, it achieves +9.7% improvement on ALFWorld and +6.

已证实

证据不足

无法验证

N/A

可复现性

置信度

79%

核心问题

Can agent skills be internalized into model parameters rather than relying on inference-time skill augmentation, thereby addressing limitations of retrieval noise, token overhead, and lack of competence internalization?

核心方法

{'approach': 'SKILL0 uses In-Context Reinforcement Learning (ICRL) that provides skills during training rollouts but removes them entirely at inference time, combined with a Dynamic Curriculum that adaptively withdraws skills based on on-policy helpfulness evaluation. The framework employs a composite reward that jointly optimizes task success and compression efficiency through PPO-style training on Qwen2.5-VL models.', 'key_components': [], 'section_ids': []}

论点验证

证据不足 (50%) We propose SKILL0, the first RL framework that formulates skill internalization as an explicit training objective, moving agents from inference-time skill dependence to fully autonomous zero-shot behavior.
The paper proposes SKILL0 and demonstrates it works experimentally, but two aspects are problematic: (1) The 'first' claim cannot be verified from the paper alone - it requires exhaustive knowledge of all prior work. (2) 'Zero-shot behavior' is misch

已证实 (80%) We introduce in-context reinforcement learning, which provides structured skill guidance during training rollouts and removes it entirely at inference, directly optimizing the transition from context-dependent execution to intrinsic competence.
The ICRL mechanism is clearly specified: skills are provided during training rollouts and removed at inference. This is demonstrated through the experimental setup where models are evaluated both with and without skills (Figure 5, Table 1). The trans

已证实 (85%) We propose Dynamic Curriculum, a helpfulness-driven annealing mechanism that withdraws each skill only when the current policy no longer benefits from it, replacing rigid schedules with adaptive internalization.
Dynamic Curriculum is well-specified with Algorithm 1 and Section 3.3 providing implementation details. The helpfulness metric Δk is computed by comparing performance with/without each skill. Ablation studies in Table 2 show the three-step Filter/Ran

已证实 (80%) SKILL0 realized this curriculum through In-Context Reinforcement Learning (ICRL): skills are provided as in-context guidance during training rollouts but removed entirely at inference, so that RL optimization directly drives the transition from context-dependent execution to autonomous behavior.
This restates the ICRL contribution. The paper clearly specifies that skills are provided during training (via skill budget M(s) and selection process) and removed at inference (S^(NS) = ∅). The experimental results in Figure 5 and Table 1 show model

已证实 (85%) Dynamic Curriculum evaluates each skill file's on-policy helpfulness by comparing agent performance with and without it on a matched validation sub-task. Skills are retained only where the current policy still benefits, and discarded otherwise, until the budget reaches zero and the agent operates without any skill context.
The helpfulness evaluation mechanism is clearly specified in Section 3.3 and Algorithm 1. Each skill file Sk is evaluated on its matched validation sub-task Tk under two conditions (with/without skill). The budget annealing is demonstrated with speci

已证实 (90%) SKILL0 achieves substantial improvements over strong baselines like AgentOCR (+9.7% for ALFWorld and +6.6% for Search-QA), and competitive performance against skill-augmented methods like SkillRL.
Specific numerical improvements are reported: +9.7% for ALFWorld and +6.6% for Search-QA over AgentOCR. These appear to be derived from Table 1 results (though the table itself is referenced but not fully shown in the excerpts). The claim of 'competi

已证实 (90%) By eliminating skill reliance at inference time, SKILL0 maintains a highly efficient context of fewer than 0.5k tokens per step, significantly reducing inference overhead without sacrificing task performance.
Specific token counts are provided: 0.38k tokens/step on ALFWorld and 0.18k on Search-QA for 3B models, both under 0.5k. These are compared against SkillRL's 2.21k and 0.87k respectively. The paper demonstrates this efficiency is achieved while maint

证据不足 (60%) Skills are organized in a directory structure "skills/{task_name}/{skill_category}.md", where each Markdown file S k stores a group of related skills sharing the same task and skill category.
The directory structure is stated as a design choice, but no justification or ablation is provided for why this particular organization was chosen over alternatives. The paper mentions it but doesn't demonstrate its necessity or compare against other

已证实 (75%) We introduce a context rendering mechanism that maps the textual interaction context (including history h t and retrieved skills S) to a compact RGB image.
The context rendering mechanism is described in p_11, mapping textual context to RGB image via vision encoder. However, the paper credits this to Feng et al. (2026), so this is more of an adaptation than a novel contribution. The mechanism is specifi

证据不足 (50%) Rather than treating the compression ratio c t ∈ (0, 1] as a fixed hyperparameter, we allow the policy to self-generate c t at each step alongside the task action a t.
The policy self-generating compression ratio ct is stated, but no ablation study compares this against fixed compression ratios. The paper doesn't demonstrate why this design choice is superior to alternatives or provide analysis of how the self-gene

已证实 (80%) We introduce a composite reward following Feng et al. (2026), which jointly optimizes task success and compression efficiency. Let I succ (τ ) ∈ {0, 1} denote the binary success indicator for trajectory τ.
The composite reward formulation is clearly specified in Equation 4 (referenced as Eq. 4 in p_24). It combines task success indicator Isucc(τ) with compression efficiency via logarithmic formulation. The trade-off parameter λ is mentioned. This is a

已证实 (85%) We formulate this curriculum as a linear decay of the skill budget M (s) at each stage s ∈ {1, . . . , N S }.
The linear decay formulation is specified in p_19, with concrete examples provided: [6,3,0] for ALFWorld and [5,3,0] for Search-QA. The paper explains this bounds the step-wise reduction and limits distribution shift. Ablation in Figure 7 compares ag

已证实 (85%) Our curriculum operates in two phases: (a) an offline Relevance-Driven Skill Grouping that associates each skill file S k with a dedicated validation sub-task; and (b) an online Helpfulness-Driven Dynamic Curriculum that adaptively selects the active skill subset S based on the current policy's learning state during training process.
The two-phase curriculum is clearly described: (a) offline relevance-driven skill grouping in p_22, and (b) online helpfulness-driven dynamic curriculum in p_23. Both phases are specified with implementation details and the overall framework is valid

证据不足 (55%) We define the relevance between a validation subtask and a skill file S k as whether the subtask's domain and objective align with the skill category encoded in S k. Based on this relevance, we partition the validation set into N sub-tasks {T k } N k=1 prior to training, where T k groups all validation instances whose skill requirements correspond to S k.
The relevance definition and partitioning process are described, but the paper doesn't provide examples of how this matching is done in practice or validate that this particular grouping strategy is optimal. The claim is stated but not rigorously dem

已证实 (85%) We quantify the helpfulness metric ∆ k of each skill file S k to the current policy π θ by evaluating T k under two conditions: with S k provided (w/ skill) and without it (w/o skill) per d training steps.
The helpfulness metric Δk is clearly defined and its computation (evaluating Tk with/without Sk) is specified. The validation interval d is explored in Table 3, with d=10 selected as optimal. This is a well-specified and validated design choice.

已证实 (90%) We train the Qwen2.5-VL series using SKILL0 for at most 180 steps on 4 H800 GPUs. For ALFWorld, we adopt the training data split from GiGPO, with each batch sampling 16 tasks and 8 rollouts per prompt, and a maximum prompt length of 3,072 tokens.
Specific implementation details are provided: Qwen2.5-VL series, max 180 steps, 4 H800 GPUs, batch sampling 16 tasks with 8 rollouts, max prompt length 3,072 tokens for ALFWorld. These are concrete, verifiable specifications.

已证实 (90%) For the curriculum learning schedule, we set the validation subset size to 1,000, the number of curriculum stages to N S = 3, and initialize SkillBank from SkillRL for both environments.
Specific hyperparameter values are provided: validation subset size 1,000, NS=3 curriculum stages, SkillBank initialized from SkillRL. These are concrete implementation details.

已证实 (90%) Using 3B models, SKILL0 consumes only 0.38k tokens per step on ALFWorld and 0.18k on Search-QA. This is a massive reduction compared to text-based or skill-augmented method like SkillRL, which costs 2.21k and 0.87k tokens per step respectively (more than 5× higher).
Specific token consumption numbers are provided: SKILL0 uses 0.38k (ALFWorld) and 0.18k (Search-QA), compared to SkillRL's 2.21k and 0.87k. The 5× reduction claim is mathematically verifiable from these numbers (2.21/0.38 ≈ 5.8×, 0.87/0.18 ≈ 4.8×).

已证实 (85%) Throughout RL optimization, SKILL0 maintains consistently higher reward curves on both the 3B and 7B backbones compared to the AgentOCR baseline.
The claim references Figure 3 and 4 showing reward curves. While the figures themselves aren't fully reproduced in the excerpts, the paper explicitly states SKILL0 maintains 'consistently higher reward curves on both the 3B and 7B backbones compared

已证实 (80%) When validated with skill augmentation, the model achieves faster early-stage performance improvement; while validation without skill prompts yields lower initial performance, it gradually catches up toward the end of optimization, revealing a clear trend of skill internalization.
This finding is supported by Figure 5(a) which shows validation accuracy with and without skill prompts over training. The paper describes the pattern: faster early improvement with skills, gradual catch-up without skills. This demonstrates skill int

... 共 37 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

No code repository available - core algorithm implementation not accessible
No data available - training/evaluation datasets not provided
Model architecture details not specified (vision encoder, language model, etc.)
Training hyperparameters missing (learning rate, batch size, epochs, optimizer)
Random seeds not reported for reproducibility
Hardware/environment specifications not provided
Training/evaluation data splits not described
Full prompts referenced in Figure 11 and 12 but not included in text
SkillBank contents only referenced in Table 7 - not fully detailed
Baseline implementation details not provided

局限性（作者自述）

SKILL0 relies on the quality of the initial SkillBank, and the offline relevance-driven skill grouping requires re-partitioning when applied to new task domains.

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-04-03T16:30:25+00:00 · 数据来源：Paper Collector