基于 Paper Collector 热门论文,自动进行论点验证、可复现性评估和置信度评分。
Skill1 unifies skill selection, utilization, and distillation in RL agents through a single policy optimized via reward
OpenSearch-VL presents a fully open recipe for training multimodal deep search agents with publicly released data, code,
Stream-T1 pioneers Test-Time Scaling for streaming video generation through Noise Propagation, Reward Pruning, and Memor
OpenSeeker-v2 shows simple supervised fine-tuning achieves state-of-the-art search agent performance using high-quality,
Stream-R1 introduces reliability-perplexity aware distillation for streaming video generation, reweighting supervision v
RLDX-1 is a Vision-Language-Action model for human-like dexterous manipulation that integrates motion awareness, long-te
ARIS introduces a three-layer autonomous research architecture with cross-family executor/reviewer separation and a thre
MolmoAct2 introduces an open-source action reasoning model for real-world robot deployment, featuring flow-matching acti
AcademiClaw introduces the first academic-level AI agent benchmark with 80 bilingual tasks from real university workflow
This paper introduces Direct Corpus Interaction (DCI), where agents search raw corpora using terminal tools instead of s
UniVidX leverages Video Diffusion Model priors for unified video generation across 15 tasks using Stochastic Condition M
RecursiveMAS transforms multi-agent collaboration from text to latent space using RecursiveLink modules, achieving 8.3%
Programming with Data introduces a test-driven framework for LLM data engineering using three-level knowledge structures
World-R1 introduces an RL framework that injects 3D geometric understanding into video generation models without archite
The paper identifies validity issues in VSI-Bench for VLM spatial reasoning: annotation drift and scene-observability mi
The paper introduces the Semantic Progress Function (SPF) for quantifying semantic change in videos and semantic lineari
This paper introduces OneManCompany (OMC), a framework for organizing heterogeneous AI agents through three pillars: Tal
LLaDA2.0-Uni unifies multimodal understanding and generation using a 16B MoE diffusion language model with a SigLIP-VQ t
NPO optimizes RLVR by using near-future checkpoints to guide current policy, formalizing the quality-variance trade-off
Tstars-Tryon 1.0 presents a commercial-grade virtual try-on system addressing robustness, realism, multi-item flexibilit
CoInteract synthesizes physically-consistent human-object interaction videos using a Diffusion Transformer with embedded
OneVL enables efficient one-step latent reasoning for autonomous driving via dual-modal auxiliary decoders and prefill i
OpenGame introduces the first open-source agentic framework for end-to-end web game creation, combining GameCoder-27B wi
Agent-World scales agent training by mining 1,978 real-world environments with 19,822 tools from the web. Using multi-en
Qwen3.5-Omni is a fully omnimodal LLM unifying understanding, reasoning, generation, and action across text, images, aud
HY-World 2.0 presents the first open-source multi-modal world model unifying 3D generation and reconstruction through a
Seedance 2.0 introduces a unified multi-modal architecture for audio-video generation, supporting text, image, audio, an
SpatialEvo pioneers self-evolving 3D spatial reasoning through a Deterministic Geometric Environment providing zero-nois
DiPO addresses exploration-exploitation trade-offs in LLM post-training through perplexity space disentanglement and bid
AgentSPEX introduces a YAML-based declarative language for AI agent workflows with explicit control flow and modular com
This paper identifies two conditions governing On-Policy Distillation effectiveness: thinking-pattern consistency and ne
KnowRL addresses reward sparsity in LLM reasoning RL by identifying minimal sufficient knowledge point subsets via Const
OmniShow introduces the first unified framework for Human-Object Interaction Video Generation, simultaneously conditioni
RationalRewards introduces reasoning-based reward models that generate structured critiques before scores, enabling opti
MEDS prevents policy collapse in LLM reinforcement learning by recording historical error patterns and penalizing repeti
OCCUBENCH benchmarks AI agents on 100 real-world professional tasks using Language Environment Simulators. Testing 15 fr
SATO introduces strip-based tokenization for generating artist-quality 3D meshes with native UV segmentation. By seriali
EXAONE 4.5 is LG's first open-weight Vision-Language Model combining a 1.2B vision encoder with a 32B language model for
WildDet3D introduces an open-vocabulary monocular 3D detector unifying text, point, and box prompts with geometry-aware
NUMINA improves numerical alignment in text-to-video diffusion models through a training-free approach that detects coun
CLAWBENCH evaluates AI agents on 153 real-world web tasks across 144 live platforms, revealing dramatic performance gaps
SkillClaw enables collective skill evolution in multi-user LLM agent ecosystems through centralized trajectory aggregati
MegaStyle leverages consistent T2I style mapping to build MegaStyle-1.4M dataset with intra-style consistency and inter-
DMax enables aggressive parallel decoding in diffusion language models through On-Policy Uniform Training and Soft Paral
This paper proposes externalization—relocating cognitive burdens into persistent external structures—as the unifying log
LPM 1.0 introduces a video-generative system for conversational performance, solving the trilemma of expressiveness, rea
HY-Embodied-0.5 presents embodied foundation models using Mixture-of-Transformers architecture with visual latent tokens
GameWorld introduces a standardized benchmark for multimodal game agents with 34 browser games and 170 tasks, featuring
FORGE benchmarks 18 MLLMs on manufacturing tasks using 2D/3D data across three scenarios. Models excel at macroscopic re
This work challenges the 'SFT memorizes' narrative by demonstrating that cross-domain generalization in reasoning SFT is
The paper identifies
Claw-Eval introduces a trustworthy evaluation framework for autonomous LLM agents through full-trajectory auditing and m
Video-MME-v2 introduces a comprehensive benchmark with three-level hierarchy and group-based non-linear scoring to evalu
TriAttention exploits Q/K vector concentration in pre-RoPE space to model attention via trigonometric series for KV cach
SpatialEdit introduces a comprehensive framework for fine-grained spatial image editing with a benchmark, 500k dataset,
MinerU2.5-Pro proves document parsing bottlenecks stem from data quality, not architecture. Using the same 1.2B-paramete
This paper reformulates image generation as iterative Plan-Sketch-Inspect-Refine cycles using BAGEL-7B with scene-graph
OpenWorldLib introduces a unified framework standardizing world models as perception-centered systems with interaction a
MIA integrates brain-inspired memory mechanisms into Deep Research Agents via a Manager-Planner-Executor architecture wi
AURA enables VideoLLMs to process continuous video streams with real-time responses through dual sliding windows and sil
ACES breaks the circular dependency in code-test evaluation using leave-one-out AUC to measure test quality without know
InCoder-32B-Thinking integrates Error-driven Chain-of-Thought synthesis with an Industrial Code World Model for hardware
Identifies why on-policy self-distillation fails in RLVR (irreducible information gaps causing leakage) and proposes RLS
GrandCode introduces a multi-agent reinforcement learning system with Agentic GRPO for competitive programming. It achie
This paper presents a 4M-frame dataset from video games with synchronized RGB and G-buffer channels, captured via a non-
SteerViT enables text-steerable visual representations by inserting cross-attention layers into frozen Vision Transforme
SIMPLESTREAM shows that a simple baseline using only recent frames without complex memory mechanisms achieves state-of-t
VOID extends video object removal to causal interactions where removing objects affects other scene elements. Using coun
SKILL0 introduces the first RL framework that internalizes agent skills into model parameters through In-Context Reinfor
Textual Frequency Law (TFL) proposes preferring higher-frequency data for LLM training when meanings are identical. The
This survey examines latent space computation in language models, proposing a taxonomy across Mechanism and Ability dime
CORAL introduces autonomous multi-agent evolution where agents control decisions through shared memory and heartbeats. I
The paper reveals RL models in VLMs lose divergent thinking due to GRPO's diversity collapse within 20 steps. While RL m
PixelSmile introduces a diffusion framework for fine-grained facial expression editing, addressing semantic overlap betw
Intern-S1-Pro is the first trillion-parameter scientific multimodal foundation model, demonstrating that large generalis
MinerU-Diffusion reformulates document OCR as inverse rendering via diffusion decoding with block-attention architecture