AI 论文深度分析 | NGJOO 恩筑AI

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

Skill1 unifies skill selection, utilization, and distillation in RL agents through a single policy optimized via reward

2026-05-07arXiv: 2605.06130

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

OpenSearch-VL presents a fully open recipe for training multimodal deep search agents with publicly released data, code,

2026-05-06arXiv: 2605.05185

Stream-T1: Test-Time Scaling for Streaming Video Generation

Stream-T1 pioneers Test-Time Scaling for streaming video generation through Noise Propagation, Reward Pruning, and Memor

2026-05-06arXiv: 2605.04461

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

OpenSeeker-v2 shows simple supervised fine-tuning achieves state-of-the-art search agent performance using high-quality,

2026-05-05arXiv: 2605.04036

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

Stream-R1 introduces reliability-perplexity aware distillation for streaming video generation, reweighting supervision v

2026-05-05arXiv: 2605.03849

RLDX-1 Technical Report

RLDX-1 is a Vision-Language-Action model for human-like dexterous manipulation that integrates motion awareness, long-te

2026-05-05arXiv: 2605.03269

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

ARIS introduces a three-layer autonomous research architecture with cross-family executor/reviewer separation and a thre

2026-05-04arXiv: 2605.03042

MolmoAct2: Action Reasoning Models for Real-world Deployment

MolmoAct2 introduces an open-source action reasoning model for real-world robot deployment, featuring flow-matching acti

2026-05-04arXiv: 2605.02881

AcademiClaw: When Students Set Challenges for AI Agents

AcademiClaw introduces the first academic-level AI agent benchmark with 80 bilingual tasks from real university workflow

2026-05-04arXiv: 2605.02661

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

This paper introduces Direct Corpus Interaction (DCI), where agents search raw corpora using terminal tools instead of s

2026-05-03arXiv: 2605.05242

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

UniVidX leverages Video Diffusion Model priors for unified video generation across 15 tasks using Stochastic Condition M

2026-05-01arXiv: 2605.00658

Recursive Multi-Agent Systems

RecursiveMAS transforms multi-agent collaboration from text to latent space using RecursiveLink modules, achieving 8.3%

2026-04-28arXiv: 2604.25917

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

Programming with Data introduces a test-driven framework for LLM data engineering using three-level knowledge structures

2026-04-27arXiv: 2604.24819

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

World-R1 introduces an RL framework that injects 3D geometric understanding into video generation models without archite

2026-04-27arXiv: 2604.24764

ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

The paper identifies validity issues in VSI-Bench for VLM spatial reasoning: annotation drift and scene-observability mi

2026-04-27arXiv: 2604.24300

Video Analysis and Generation via a Semantic Progress Function

The paper introduces the Semantic Progress Function (SPF) for quantifying semantic change in videos and semantic lineari

2026-04-24arXiv: 2604.22554

From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

This paper introduces OneManCompany (OMC), a framework for organizing heterogeneous AI agents through three pillars: Tal

2026-04-24arXiv: 2604.22446

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

LLaDA2.0-Uni unifies multimodal understanding and generation using a 16B MoE diffusion language model with a SigLIP-VQ t

2026-04-22arXiv: 2604.20796

Near-Future Policy Optimization

NPO optimizes RLVR by using near-future checkpoints to guide current policy, formalizing the quality-variance trade-off

2026-04-22arXiv: 2604.20733

Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items

Tstars-Tryon 1.0 presents a commercial-grade virtual try-on system addressing robustness, realism, multi-item flexibilit

2026-04-21arXiv: 2604.19748

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

CoInteract synthesizes physically-consistent human-object interaction videos using a Diffusion Transformer with embedded

2026-04-21arXiv: 2604.19636

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

OneVL enables efficient one-step latent reasoning for autonomous driving via dual-modal auxiliary decoders and prefill i

2026-04-20arXiv: 2604.18486

OpenGame: Open Agentic Coding for Games

OpenGame introduces the first open-source agentic framework for end-to-end web game creation, combining GameCoder-27B wi

2026-04-20arXiv: 2604.18394

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

Agent-World scales agent training by mining 1,978 real-world environments with 19,822 tools from the web. Using multi-en

2026-04-20arXiv: 2604.18292

Qwen3.5-Omni Technical Report

Qwen3.5-Omni is a fully omnimodal LLM unifying understanding, reasoning, generation, and action across text, images, aud

2026-04-17arXiv: 2604.15804

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

HY-World 2.0 presents the first open-source multi-modal world model unifying 3D generation and reconstruction through a

2026-04-15arXiv: 2604.14268

Seedance 2.0: Advancing Video Generation for World Complexity

Seedance 2.0 introduces a unified multi-modal architecture for audio-video generation, supporting text, image, audio, an

2026-04-15arXiv: 2604.14148

SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

SpatialEvo pioneers self-evolving 3D spatial reasoning through a Deterministic Geometric Environment providing zero-nois

2026-04-15arXiv: 2604.14144

DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

DiPO addresses exploration-exploitation trade-offs in LLM post-training through perplexity space disentanglement and bid

2026-04-15arXiv: 2604.13902

AgentSPEX: An Agent SPecification and EXecution Language

AgentSPEX introduces a YAML-based declarative language for AI agent workflows with explicit control flow and modular com

2026-04-14arXiv: 2604.13346

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

This paper identifies two conditions governing On-Policy Distillation effectiveness: thinking-pattern consistency and ne

2026-04-14arXiv: 2604.13016

KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

KnowRL addresses reward sparsity in LLM reasoning RL by identifying minimal sufficient knowledge point subsets via Const

2026-04-14arXiv: 2604.12627

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

OmniShow introduces the first unified framework for Human-Object Interaction Video Generation, simultaneously conditioni

2026-04-13arXiv: 2604.11804

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

RationalRewards introduces reasoning-based reward models that generate structured critiques before scores, enabling opti

2026-04-13arXiv: 2604.11626

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

MEDS prevents policy collapse in LLM reinforcement learning by recording historical error patterns and penalizing repeti

2026-04-13arXiv: 2604.11297

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

OCCUBENCH benchmarks AI agents on 100 real-world professional tasks using Language Environment Simulators. Testing 15 fr

2026-04-13arXiv: 2604.10866

Strips as Tokens: Artist Mesh Generation with Native UV Segmentation

SATO introduces strip-based tokenization for generating artist-quality 3D meshes with native UV segmentation. By seriali

2026-04-10arXiv: 2604.09132

EXAONE 4.5 Technical Report

EXAONE 4.5 is LG's first open-weight Vision-Language Model combining a 1.2B vision encoder with a 32B language model for

2026-04-09arXiv: 2604.08644

WildDet3D: Scaling Promptable 3D Detection in the Wild

WildDet3D introduces an open-vocabulary monocular 3D detector unifying text, point, and box prompts with geometry-aware

2026-04-09arXiv: 2604.08626

When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

NUMINA improves numerical alignment in text-to-video diffusion models through a training-free approach that detects coun

2026-04-09arXiv: 2604.08546

ClawBench: Can AI Agents Complete Everyday Online Tasks?

CLAWBENCH evaluates AI agents on 153 real-world web tasks across 144 live platforms, revealing dramatic performance gaps

2026-04-09arXiv: 2604.08523

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

SkillClaw enables collective skill evolution in multi-user LLM agent ecosystems through centralized trajectory aggregati

2026-04-09arXiv: 2604.08377

MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping

MegaStyle leverages consistent T2I style mapping to build MegaStyle-1.4M dataset with intra-style consistency and inter-

2026-04-09arXiv: 2604.08364

DMax: Aggressive Parallel Decoding for dLLMs

DMax enables aggressive parallel decoding in diffusion language models through On-Policy Uniform Training and Soft Paral

2026-04-09arXiv: 2604.08302

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

This paper proposes externalization—relocating cognitive burdens into persistent external structures—as the unifying log

2026-04-09arXiv: 2604.08224

LPM 1.0: Video-based Character Performance Model

LPM 1.0 introduces a video-generative system for conversational performance, solving the trilemma of expressiveness, rea

2026-04-09arXiv: 2604.07823

HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

HY-Embodied-0.5 presents embodied foundation models using Mixture-of-Transformers architecture with visual latent tokens

2026-04-08arXiv: 2604.07430

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

GameWorld introduces a standardized benchmark for multimodal game agents with 34 browser games and 170 tasks, featuring

2026-04-08arXiv: 2604.07429

FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios

FORGE benchmarks 18 MLLMs on manufacturing tasks using 2D/3D data across three scenarios. Models excel at macroscopic re

2026-04-08arXiv: 2604.07413

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

This work challenges the 'SFT memorizes' narrative by demonstrating that cross-domain generalization in reasoning SFT is

2026-04-08arXiv: 2604.06628

RAGEN-2: Reasoning Collapse in Agentic RL

The paper identifies

2026-04-07arXiv: 2604.06268

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Claw-Eval introduces a trustworthy evaluation framework for autonomous LLM agents through full-trajectory auditing and m

2026-04-07arXiv: 2604.06132

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Video-MME-v2 introduces a comprehensive benchmark with three-level hierarchy and group-based non-linear scoring to evalu

2026-04-06arXiv: 2604.05015

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

TriAttention exploits Q/K vector concentration in pre-RoPE space to model attention via trigonometric series for KV cach

2026-04-06arXiv: 2604.04921

SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

SpatialEdit introduces a comprehensive framework for fine-grained spatial image editing with a benchmark, 500k dataset,

2026-04-06arXiv: 2604.04911

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

MinerU2.5-Pro proves document parsing bottlenecks stem from data quality, not architecture. Using the same 1.2B-paramete

2026-04-06arXiv: 2604.04771

Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

This paper reformulates image generation as iterative Plan-Sketch-Inspect-Refine cycles using BAGEL-7B with scene-graph

2026-04-06arXiv: 2604.04746

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

OpenWorldLib introduces a unified framework standardizing world models as perception-centered systems with interaction a

2026-04-06arXiv: 2604.04707

Memory Intelligence Agent

MIA integrates brain-inspired memory mechanisms into Deep Research Agents via a Manager-Planner-Executor architecture wi

2026-04-06arXiv: 2604.04503

AURA: Always-On Understanding and Real-Time Assistance via Video Streams

AURA enables VideoLLMs to process continuous video streams with real-time responses through dual sliding windows and sil

2026-04-05arXiv: 2604.04184

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

ACES breaks the circular dependency in code-test evaluation using leave-one-out AUC to measure test quality without know

2026-04-05arXiv: 2604.03922

InCoder-32B-Thinking: Industrial Code World Model for Thinking

InCoder-32B-Thinking integrates Error-driven Chain-of-Thought synthesis with an Industrial Code World Model for hardware

2026-04-03arXiv: 2604.03144

Self-Distilled RLVR

Identifies why on-policy self-distillation fails in RLVR (irreducible information gaps causing leakage) and proposes RLS

2026-04-03arXiv: 2604.03128

GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning

GrandCode introduces a multi-agent reinforcement learning system with Agentic GRPO for competitive programming. It achie

2026-04-03arXiv: 2604.02721

Generative World Renderer

This paper presents a 4M-frame dataset from video games with synchronized RGB and G-buffer channels, captured via a non-

2026-04-02arXiv: 2604.02329

Steerable Visual Representations

SteerViT enables text-steerable visual representations by inserting cross-attention layers into frozen Vision Transforme

2026-04-02arXiv: 2604.02327

A Simple Baseline for Streaming Video Understanding

SIMPLESTREAM shows that a simple baseline using only recent frames without complex memory mechanisms achieves state-of-t

2026-04-02arXiv: 2604.02317

VOID: Video Object and Interaction Deletion

VOID extends video object removal to causal interactions where removing objects affects other scene elements. Using coun

2026-04-02arXiv: 2604.02296

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

SKILL0 introduces the first RL framework that internalizes agent skills into model parameters through In-Context Reinfor

2026-04-02arXiv: 2604.02268

Adam's Law: Textual Frequency Law on Large Language Models

Textual Frequency Law (TFL) proposes preferring higher-frequency data for LLM training when meanings are identical. The

2026-04-02arXiv: 2604.02176

The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook

This survey examines latent space computation in language models, proposing a taxonomy across Mechanism and Ability dime

2026-04-02arXiv: 2604.02029

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

CORAL introduces autonomous multi-agent evolution where agents control decisions through shared memory and heartbeats. I

2026-04-02arXiv: 2604.01658

All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

The paper reveals RL models in VLMs lose divergent thinking due to GRPO's diversity collapse within 20 steps. While RL m

2026-04-01arXiv: 2604.00479

PixelSmile: Toward Fine-Grained Facial Expression Editing

PixelSmile introduces a diffusion framework for fine-grained facial expression editing, addressing semantic overlap betw

2026-03-26arXiv: 2603.25728

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

Intern-S1-Pro is the first trillion-parameter scientific multimodal foundation model, demonstrating that large generalis

2026-03-26arXiv: 2603.25040

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

MinerU-Diffusion reformulates document OCR as inverse rendering via diffusion decoding with block-attention architecture

2026-03-23arXiv: 2603.22458